Skip to content

Conversation

@jeffreyssmith2nd
Copy link
Contributor

@jeffreyssmith2nd jeffreyssmith2nd commented Jan 6, 2025

Which issue does this PR close?

Closes #14036

Rationale for this change

What changes are included in this PR?

Return Pending when a new File is opened in the FileStream to yield control back to Tokio. This should help prevent queries from running after cancellation.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jan 6, 2025
@jeffreyssmith2nd jeffreyssmith2nd marked this pull request as ready for review January 7, 2025 15:51
Comment on lines +485 to +486
cx.waker().wake_by_ref();
return Poll::Pending;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may not be correct. The wake_by_ref should be called AFTER returning Pending. I think you may be lucky that this works, but it's undefined behavior. If I understand the yield_now impl. correctly, it tries to delay the "wake" call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was odd too (I had seen this pattern elsewhere) but looking at the implementation of yield_now, when you poll it does a context::defer and then returns Pending (source). context::defer will directly call wake_by_ref if called from outside of the runtime (source).

I could be misunderstanding what "outside the runtime" means in this case/how that will actually interact with the yield.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the multi-thread RT (which is the only thing we really care about, I think it's NOT instantly calling wake_by_ref instantly:

  1. yield_now: https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/task/yield_now.rs#L57
  2. context::defer: https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/context.rs#L166-L177
  3. with_scheduler: https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/context.rs#L183-L195
  4. (via some indirection) Context::defer: https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/mod.rs#L287-L289
  5. (via some indirection) Context::defer (different Context this time): https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/multi_thread/worker.rs#L766-L768
  6. Defer::defer: https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/defer.rs#L15-L26
  7. The deferred threads are woken in two places (1, 2) for which the 2nd one seems relevant. The actual waking will happen when the worker is -- I think -- unparked after a "park timeout". This is definitely long after Pending is returned, because the thread that is returning Pending is the very same worker, i.e. it cannot possibly be parked at this point in time.

Copy link
Contributor

@alamb alamb Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be possible to call yield_now directly rather than trying to insert waker manipulation directly into the file opener

Like could we do something like

let future = tokio::task::yield_now()

And return future.poll_next() 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may not be correct. The wake_by_ref should be called AFTER returning Pending. I think you may be lucky that this works, but it's undefined behavior. If I understand the yield_now impl. correctly, it tries to delay the "wake" call.

I was thinking similarly, but those have changed my mind:
https://github.com/rust-lang/rust/blob/1f81f906893d166d05fb4839f169983f2b564cc7/library/core/src/task/wake.rs#L423-L426

and this example:
https://github.com/rust-lang/rust/blob/1f81f906893d166d05fb4839f169983f2b564cc7/library/core/src/task/wake.rs#L703-L704

I believe the usage is not a problem, but yielding after being ready to open the next file seems not the correct solution to me. Does this also start to degrade the throughput for small files? Do we have any example of this in the codebase (returning pending not because of an IO)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may not be correct. The wake_by_ref should be called AFTER returning Pending. I think you may be lucky that this works, but it's undefined behavior. If I understand the yield_now impl. correctly, it tries to delay the "wake" call.

I was thinking similarly, but those have changed my mind: rust-lang/rust@1f81f90/library/core/src/task/wake.rs#L423-L426

and this example: rust-lang/rust@1f81f90/library/core/src/task/wake.rs#L703-L704

good point, thanks

Do we have any example of this in the codebase (returning pending not because of an IO)?

We do, see #5299 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its clearer, we can remove these lines and change the lines above it to

self.state = FileStreamState::Open {
  future: Box::pin(async {
    yield_now().await;
    reader
  }),
  partition_values,
};

This works because this future gets polled in the next iteration of the loop when we transition to the Open state.

crepererum
crepererum previously approved these changes Jan 8, 2025
Copy link
Contributor

@wiedld wiedld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a test case like this one be useful?
influxdata@b9e4d8b

It does capture the difference with, versus without, this change.

@ozankabak
Copy link
Contributor

I am not sure this is the right solution to the problem. @berkaysynnada will be looking deeper into why this happens as he gathers more information from @jeffreyssmith2nd. Let's not merge this PR until then. Thanks

@alamb
Copy link
Contributor

alamb commented Jan 9, 2025

Would a test case like this one be useful? influxdata@b9e4d8b

It does capture the difference with, versus without, this change.

Thank you @wiedld -- I think this type of test is basically testing the implementation (eg. testing that yielding is happening). However the behavior we care about here is that the stream stops processing when dropped (aka canceled)

Thus I agree with @ozankabak that we should get some sort of higher level reproducer to be sure we have fixed the root cause (rather than just treating the symptom)

@alamb alamb dismissed crepererum’s stale review January 27, 2025 13:48

Dismissing review so we don't accidentally merge this by accident

@alamb alamb marked this pull request as draft January 27, 2025 13:48
@alamb
Copy link
Contributor

alamb commented Jan 27, 2025

Marking as draft to clean up the review queue. I believe this PR is waiting on a test / reproducer

alamb pushed a commit that referenced this pull request Mar 14, 2025
* Document guidelines for physical operator yielding

To start a policy of the behavior physical operator streams should have
and drive improvements in this area to allow for timely cancellation.

Connects to #14036 and related to pull requests such as #14028.

* Remove discussion of tokio coop

* Move rationale up

* TODO ADD LINK

* Say block the CPU rather than pin the CPU

* Add a caveat to use the right tool for the situation

* Improve documentation of yield guidelines

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

* Fix newlines and whitespace in comment

* Add a link to the cancellation benchmark documented in the README

* Fix newlines in benchmarks README

---------

Co-authored-by: Mehmet Ozan Kabak <[email protected]>
@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Mar 29, 2025
@github-actions github-actions bot closed this Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate Stale PR has not had any activity for some time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Un-cancellable Query when hitting many large files.

6 participants