Skip to content

[syscall] B: Fix diamond DT_NEEDED handling#28

Closed
esaurez wants to merge 1 commit into
feat/dlfcn-init-array-and-runpathfrom
feat/dlfcn-diamond-needed
Closed

[syscall] B: Fix diamond DT_NEEDED handling#28
esaurez wants to merge 1 commit into
feat/dlfcn-init-array-and-runpathfrom
feat/dlfcn-diamond-needed

Conversation

@esaurez

@esaurez esaurez commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes two latent bugs in the dynamic loader that prevent it from handling diamond DT_NEEDED graphs — the canonical multi-edge shape that production shared-library chains take:

libdiamond.so ─┬─ DT_NEEDED libleft.so  ──┐
               └─ DT_NEEDED libright.so ──┴── DT_NEEDED libbase.so

Both bugs only surface once DT_NEEDED chains are nontrivial. The existing dlfcn-needed-c posix-test exercises a single linear edge, so it never tripped either path. The new esaurez/posix-tests dlfcn-diamond-c suite covers this case.

Bug 1 — dlopen deadlocks on ancestor mutex

load_all_dependencies_recursive holds the parent library's mutex while iterating the registry's already-loaded entries. Each entry gets .lock()-ed to compare names. The previous code skipped only the immediate new_dlhandle self — not ancestors held by outer recursive frames. Loading libright.so (depth 1) and then recursing into its deps (depth 2) tried to lock libdiamond.so in the inner scan, which deadlocked because libdiamond.so was held by the outer frame.

Fix: thread a BTreeSet<DlHandle> of ancestor handles down the recursion and skip every entry whose handle is in that set. The ancestor is by definition not a candidate for "already loaded by a sibling" so the skip is functionally safe.

Bug 2 — dlopen races on transitive loads (diamond duplication)

Once the deadlock is gone, a second issue surfaces: the pre-loop retain only sees libraries loaded before the current frame started. In a diamond, libbase.so is loaded by libleft.so's recursion during libdiamond.so's frame, after libdiamond.so's retain has already run. Without a re-check, the outer loop would then open libbase.so a second time, producing two distinct in-memory copies with their own globals (or tripping the existing unreachable!() when the VFS reuses the file descriptor).

Fix: before each DynamicLibrary::open call, re-scan the registry for a matching name and bind the existing copy if found.

Bug 3 — dlclose over-strict refcount assertion

The BFS unload loop extracts each library from the registry when Arc::strong_count == 2 (registry + current pop). For a diamond, libbase.so has three Arc references at the time it is first popped (registry + libleft.dependencies + libright.dependencies), so extract_if correctly returns 0 entries — but the assertion demanded exactly 1, panicking the kernel. A later iteration of the loop pops libright/libleft, drops their dependencies (releasing the extra libbase ref), pushes libbase again, and the second pop succeeds at the now-correct refcount.

Fix: replace the strict assert_eq!(== 1) with an early continue when extract_if returns zero entries, keeping the (now relaxed) "at most one" assertion to catch any future runtime invariant violation.

What changes

File Change
src/libs/syscall/src/dlfcn/syscall/dlopen.rs Thread ancestors: &mut BTreeSet<DlHandle> through load_all_dependencies_recursive; skip ancestor handles in the retain closure and the pre-open re-check; add a pre-open registry re-check for transitive diamond loads.
src/libs/syscall/src/dlfcn/syscall/dlclose.rs Skip rather than panic when extract_if returns 0 entries; relax the assertion message accordingly.

Net diff: +88 / −6 lines, no API changes.

Validation

  • z build all FEATURES=networking LOG_LEVEL=error RELEASE=yes MEMORY_SIZE=256 PASS
  • z build format-check, rust-lint-check, spellcheck PASS
  • New dlfcn-diamond-c suite — 3/3 PASS on standalone VM:
    === dlfcn diamond DT_NEEDED tests ===
      PASS: dlopen(libright.so) [depth-2 chain]
      PASS: dlopen(libdiamond.so)
      PASS: re-dlopen(libdiamond.so) returns same handle
    3 passed, 0 failed
    
  • All previously passing dlfcn tests (dlfcn-c, dlfcn-init-runpath-c, dlfcn-pie-c) still pass.
  • Full posix-tests suite passes.

Companion PR

esaurez/posix-tests feat/dlfcn-diamond-needed — the test suite this PR fixes. Without this PR, the diamond test hangs in the loader and times out at 120s.

Why it matters

This bug blocks the entire .a.so migration. Every real-world .so chain — including the upcoming _lxml_etree.cpython-312.solibxslt.solibxml2.so chain on top of esaurez/nanvix#27 — eventually produces a diamond when multiple consumers share a common lower-level library. Without this fix, dlopen of any such .so hangs the guest indefinitely.

Stacking

Stacked on top of esaurez/nanvix#27. Merge order: nanvix#27 first, then this PR.

Downstream consumers

This loader change unblocks the CPython .a.so migration and its supporting port-libraries:

The dynamic loader walks `DT_NEEDED` recursively, but two bugs broke
diamond dependency graphs of the shape:

  libdiamond.so -> libleft.so  -> libbase.so
                -> libright.so -> libbase.so

both of which only surface once `DT_NEEDED` chains are nontrivial
(the existing dlfcn-needed-c test stays a single-edge linear graph,
so it never exercised either path). Symptoms before this commit:

  1. `dlopen(libdiamond.so)` HANGS inside the recursive load.
  2. After the hang fix is in place, `dlclose(libdiamond.so)` panics
     with `assertion left == right failed: expected to remove exactly
     one dynamic library file`.

Fix 1 -- dlopen deadlock on ancestor lock
-----------------------------------------

`load_all_dependencies_recursive` holds the parent library's mutex
while iterating the registry's already-loaded entries. Each entry
gets `.lock()`-ed to compare names. The previous code skipped only
the immediate `new_dlhandle` self; it did not skip ancestors held
by outer recursive frames. Loading libright (depth 1) and then
recursing into libright's deps (depth 2) tried to lock libdiamond
in the inner scan, which deadlocked: libdiamond was held by the
outer frame.

Fix: thread a `BTreeSet<DlHandle>` of ancestor handles down the
recursion and skip every entry whose handle is in that set. The
ancestor is by definition not a candidate for "already loaded by
a sibling" so the skip is functionally safe.

Fix 2 -- dlopen now also handles transitive load races
------------------------------------------------------

Once the deadlock is gone, a second issue surfaces: the pre-loop
`retain` only sees libraries loaded *before* the current frame
started. In a diamond, libbase is loaded by libleft's recursion
*during* libdiamond's frame, after libdiamond's retain has already
run. Without a re-check, the outer loop would then open libbase a
second time (creating two distinct in-memory copies with their own
globals) or trip the `unreachable!()` when the VFS reuses the file
descriptor.

Fix: before each `DynamicLibrary::open` call, re-scan the registry
for a matching name and bind the existing copy if found.

Fix 3 -- dlclose strong_count assertion for diamond
---------------------------------------------------

The BFS unload loop extracts each library from the registry when
`Arc::strong_count == 2` (registry + current pop). For a diamond,
libbase has three Arc references at the time it is first popped
(registry + libleft.dependencies + libright.dependencies), so
`extract_if` correctly returns 0 entries -- but the assertion
demanded exactly 1, panicking the kernel. A later iteration of the
loop pops libright/libleft, drops their dependencies (releasing
the extra libbase ref), pushes libbase again, and the second pop
succeeds at the now-correct refcount.

Fix: replace the strict `assert_eq!(== 1)` with an early `continue`
when `extract_if` returns zero entries, keeping the (now relaxed)
"at most one" assertion to catch any future runtime invariant
violation.

Validation
----------

A companion `dlfcn-diamond-c` test suite
proves both fixes end-to-end on a standalone Nanvix VM:

  === dlfcn diamond DT_NEEDED tests ===
    PASS: dlopen(libright.so) [depth-2 chain]
    PASS: dlopen(libdiamond.so)
    PASS: re-dlopen(libdiamond.so) returns same handle
  3 passed, 0 failed

All previously passing dlfcn tests (dlfcn-c, dlfcn-init-runpath-c,
dlfcn-pie-c) still pass; full posix-tests suite passes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@esaurez esaurez force-pushed the feat/dlfcn-diamond-needed branch from bfdd573 to 6f4e0ae Compare June 9, 2026 01:52
@esaurez

esaurez commented Jun 9, 2026

Copy link
Copy Markdown
Owner Author

Superseded by upstream PR nanvix#2478, which carries the same diff (rebased, deep-reviewed, with esaurez/* references stripped). Closing this fork PR; tracking continues upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant