[syscall] B: Fix diamond DT_NEEDED handling#28
Closed
esaurez wants to merge 1 commit into
Closed
Conversation
This was referenced Jun 5, 2026
fae4251 to
9495ff0
Compare
03c26ce to
9495ff0
Compare
The dynamic loader walks `DT_NEEDED` recursively, but two bugs broke
diamond dependency graphs of the shape:
libdiamond.so -> libleft.so -> libbase.so
-> libright.so -> libbase.so
both of which only surface once `DT_NEEDED` chains are nontrivial
(the existing dlfcn-needed-c test stays a single-edge linear graph,
so it never exercised either path). Symptoms before this commit:
1. `dlopen(libdiamond.so)` HANGS inside the recursive load.
2. After the hang fix is in place, `dlclose(libdiamond.so)` panics
with `assertion left == right failed: expected to remove exactly
one dynamic library file`.
Fix 1 -- dlopen deadlock on ancestor lock
-----------------------------------------
`load_all_dependencies_recursive` holds the parent library's mutex
while iterating the registry's already-loaded entries. Each entry
gets `.lock()`-ed to compare names. The previous code skipped only
the immediate `new_dlhandle` self; it did not skip ancestors held
by outer recursive frames. Loading libright (depth 1) and then
recursing into libright's deps (depth 2) tried to lock libdiamond
in the inner scan, which deadlocked: libdiamond was held by the
outer frame.
Fix: thread a `BTreeSet<DlHandle>` of ancestor handles down the
recursion and skip every entry whose handle is in that set. The
ancestor is by definition not a candidate for "already loaded by
a sibling" so the skip is functionally safe.
Fix 2 -- dlopen now also handles transitive load races
------------------------------------------------------
Once the deadlock is gone, a second issue surfaces: the pre-loop
`retain` only sees libraries loaded *before* the current frame
started. In a diamond, libbase is loaded by libleft's recursion
*during* libdiamond's frame, after libdiamond's retain has already
run. Without a re-check, the outer loop would then open libbase a
second time (creating two distinct in-memory copies with their own
globals) or trip the `unreachable!()` when the VFS reuses the file
descriptor.
Fix: before each `DynamicLibrary::open` call, re-scan the registry
for a matching name and bind the existing copy if found.
Fix 3 -- dlclose strong_count assertion for diamond
---------------------------------------------------
The BFS unload loop extracts each library from the registry when
`Arc::strong_count == 2` (registry + current pop). For a diamond,
libbase has three Arc references at the time it is first popped
(registry + libleft.dependencies + libright.dependencies), so
`extract_if` correctly returns 0 entries -- but the assertion
demanded exactly 1, panicking the kernel. A later iteration of the
loop pops libright/libleft, drops their dependencies (releasing
the extra libbase ref), pushes libbase again, and the second pop
succeeds at the now-correct refcount.
Fix: replace the strict `assert_eq!(== 1)` with an early `continue`
when `extract_if` returns zero entries, keeping the (now relaxed)
"at most one" assertion to catch any future runtime invariant
violation.
Validation
----------
A companion `dlfcn-diamond-c` test suite
proves both fixes end-to-end on a standalone Nanvix VM:
=== dlfcn diamond DT_NEEDED tests ===
PASS: dlopen(libright.so) [depth-2 chain]
PASS: dlopen(libdiamond.so)
PASS: re-dlopen(libdiamond.so) returns same handle
3 passed, 0 failed
All previously passing dlfcn tests (dlfcn-c, dlfcn-init-runpath-c,
dlfcn-pie-c) still pass; full posix-tests suite passes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bfdd573 to
6f4e0ae
Compare
Owner
Author
|
Superseded by upstream PR nanvix#2478, which carries the same diff (rebased, deep-reviewed, with esaurez/* references stripped). Closing this fork PR; tracking continues upstream. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two latent bugs in the dynamic loader that prevent it from handling diamond
DT_NEEDEDgraphs — the canonical multi-edge shape that production shared-library chains take:Both bugs only surface once
DT_NEEDEDchains are nontrivial. The existingdlfcn-needed-cposix-test exercises a single linear edge, so it never tripped either path. The new esaurez/posix-testsdlfcn-diamond-csuite covers this case.Bug 1 —
dlopendeadlocks on ancestor mutexload_all_dependencies_recursiveholds the parent library's mutex while iterating the registry's already-loaded entries. Each entry gets.lock()-ed to compare names. The previous code skipped only the immediatenew_dlhandleself — not ancestors held by outer recursive frames. Loadinglibright.so(depth 1) and then recursing into its deps (depth 2) tried to locklibdiamond.soin the inner scan, which deadlocked becauselibdiamond.sowas held by the outer frame.Fix: thread a
BTreeSet<DlHandle>of ancestor handles down the recursion and skip every entry whose handle is in that set. The ancestor is by definition not a candidate for "already loaded by a sibling" so the skip is functionally safe.Bug 2 —
dlopenraces on transitive loads (diamond duplication)Once the deadlock is gone, a second issue surfaces: the pre-loop
retainonly sees libraries loaded before the current frame started. In a diamond,libbase.sois loaded bylibleft.so's recursion duringlibdiamond.so's frame, afterlibdiamond.so'sretainhas already run. Without a re-check, the outer loop would then openlibbase.soa second time, producing two distinct in-memory copies with their own globals (or tripping the existingunreachable!()when the VFS reuses the file descriptor).Fix: before each
DynamicLibrary::opencall, re-scan the registry for a matching name and bind the existing copy if found.Bug 3 —
dlcloseover-strict refcount assertionThe BFS unload loop extracts each library from the registry when
Arc::strong_count == 2(registry + current pop). For a diamond,libbase.sohas threeArcreferences at the time it is first popped (registry +libleft.dependencies+libright.dependencies), soextract_ifcorrectly returns 0 entries — but the assertion demanded exactly 1, panicking the kernel. A later iteration of the loop popslibright/libleft, drops their dependencies (releasing the extralibbaseref), pusheslibbaseagain, and the second pop succeeds at the now-correct refcount.Fix: replace the strict
assert_eq!(== 1)with an earlycontinuewhenextract_ifreturns zero entries, keeping the (now relaxed) "at most one" assertion to catch any future runtime invariant violation.What changes
src/libs/syscall/src/dlfcn/syscall/dlopen.rsancestors: &mut BTreeSet<DlHandle>throughload_all_dependencies_recursive; skip ancestor handles in theretainclosure and the pre-open re-check; add a pre-open registry re-check for transitive diamond loads.src/libs/syscall/src/dlfcn/syscall/dlclose.rsextract_ifreturns 0 entries; relax the assertion message accordingly.Net diff: +88 / −6 lines, no API changes.
Validation
z build all FEATURES=networking LOG_LEVEL=error RELEASE=yes MEMORY_SIZE=256PASSz build format-check,rust-lint-check,spellcheckPASSdlfcn-diamond-csuite — 3/3 PASS on standalone VM:dlfcn-c,dlfcn-init-runpath-c,dlfcn-pie-c) still pass.Companion PR
esaurez/posix-tests
feat/dlfcn-diamond-needed— the test suite this PR fixes. Without this PR, the diamond test hangs in the loader and times out at 120s.Why it matters
This bug blocks the entire
.a→.somigration. Every real-world.sochain — including the upcoming_lxml_etree.cpython-312.so→libxslt.so→libxml2.sochain on top of esaurez/nanvix#27 — eventually produces a diamond when multiple consumers share a common lower-level library. Without this fix,dlopenof any such.sohangs the guest indefinitely.Stacking
Stacked on top of esaurez/nanvix#27. Merge order:
nanvix#27first, then this PR.Downstream consumers
This loader change unblocks the CPython
.a→.somigration and its supporting port-libraries:libxml2.so(bottom of the chain).libxslt.so+libexslt.sowithDT_NEEDED libxml2.so.liblxml_etree.sowith full chain.