feat: prototype allocator-level OOM circuit breaker (OomGuard)#4582
Draft
andygrove wants to merge 15 commits into
Draft
feat: prototype allocator-level OOM circuit breaker (OomGuard)#4582andygrove wants to merge 15 commits into
andygrove wants to merge 15 commits into
Conversation
Open
1 task
| // exceed isize::MAX on any real platform, so no wrapping or overflow occurs. | ||
| let old = layout.size() as isize; | ||
| let new = new_size as isize; | ||
| track(new - old); |
There was a problem hiding this comment.
Just fixed this bug in DF. You need to panic before the realloc, otherwise the caller still has the old pointer and tries to free it on unwind and segfaults.
…ard [skip ci]
Account for and enforce the size delta before delegating to the inner
realloc. Panicking after inner.realloc is unsound: realloc may have freed
or moved the old block, leaving the caller to free a dangling old pointer
on unwind and segfault. Enforce while the old pointer is still valid.
Gate panic_any behind a compare_exchange on ARMED so at most one thread
fires the guard panic per arm cycle. The relaxed ARMED load on the hot
path is not a serialization point: several threads can read ARMED=true in
the same window and each dispatch a panic, which Rust's unwind ABI can
turn into a process abort ("failed to initiate panic", exit 133). The
guard re-arms on the next createPlan.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Relates to #4576. This is an exploratory prototype of the RSS circuit breaker ("OomGuard") half of that issue, not a complete implementation, so it does not close it.
Rationale for this change
Comet's memory accounting relies on voluntary
MemoryPoolreservations, which miss allocations made by Arrow buffers, join scratch space, and expression kernels. When real native memory exceeds the container limit, the OS/YARN/Kubernetes kills the entire executor JVM, losing every task and all cached data on it.This prototype adds an executor-global, allocator-level circuit breaker. It tracks the real bytes the global allocator hands out and, when an armed, over-budget condition is detected on a query-worker thread, fails that single task with a retriable
ResourcesExhaustederror instead of letting the executor get OOM-killed. The approach adapts theAccountingAllocatorfrom apache/datafusion#22626 (the byte-tracking allocator wrapper) rather than depending on it, since that code lives in DataFusion's test-onlysqllogictestcrate.What changes are included in this PR?
Gated behind a new
oom-guardcargo feature; the default build is unchanged with zero added per-allocation overhead.native/core/src/execution/memory_pools/oom_guard.rs(new):AccountingAllocator<A>wrapping the inner global allocator; a single process-wide balance with per-thread drift settled at a 64 KiB threshold;arm/disarm/stamp_current_thread/current_balance; a typedOomGuardPanic, raised viapanic_anyon an armed, stamped thread that crosses the limit, with reentrancy protection so the panic's own boxing allocation does not recurse.native/core/src/lib.rs: under theoom-guardfeature, installs the wrapper as#[global_allocator]over jemalloc / mimalloc / system; mutually exclusive cfgs leave the default build untouched.native/core/src/execution/jni_api.rs: stamps tokio worker threads (on_thread_start) and the JNI caller thread; arms the guard from config increatePlan; mapsOomGuardPanictoDataFusionError::ResourcesExhaustedat both execution boundaries (the spawned/channel path, both producer and consumer, and the busy-pollblock_onpath).spark/src/main/scala/org/apache/comet/CometConf.scala: registersspark.comet.exec.memoryGuard.enabled(default false) andspark.comet.exec.memoryGuard.size(optional; defaults to the executor off-heap size).Known limitations / out of scope for this prototype (candidates for follow-ups):
spawn_blocking/IO/other pools are tracked but cannot themselves trip the breaker.MemoryPool(the "online accounting" half of Investigate adopting DataFusion's allocator-level memory accounting to replace manual memory tuning #4576).createPlan.Note: commits carry
[skip ci]intentionally while this is an early prototype.How are these changes tested?
oom_guard.rscover the decision/settle helpers, and that the breaker trips only on an armed, stamped thread (disarmed never trips, unstamped never trips).AccountingAllocatorand asserts anOomGuardPanicis raised and caught.clippy -D warningsacross the default,oom-guard, andjemalloc,oom-guardfeature combinations.