Skip to content

[syscall] E: Honour STB_WEAK undefined symbols#22

Open
esaurez wants to merge 1 commit into
devfrom
feat/dlfcn-stb-weak-support
Open

[syscall] E: Honour STB_WEAK undefined symbols#22
esaurez wants to merge 1 commit into
devfrom
feat/dlfcn-stb-weak-support

Conversation

@esaurez

@esaurez esaurez commented May 28, 2026

Copy link
Copy Markdown
Owner

Summary

Teaches the in-process Nanvix dynamic loader to honour the System V ABI's STB_WEAK contract: an undefined weak symbol that cannot be resolved at dynamic-link time resolves to value 0, not to a loader error. This matches every mainstream ELF loader and is a prerequisite for .so files produced by GCC/Clang — the C++ runtime, TLS bootstrap, and many GNU extensions all carry weak refs that may or may not be present at load time.

Why this is required

The current loader rejects any .so containing any undefined weak symbol that no loaded module defines. In practice that breaks:

  • libstdc++: emits weak refs to pthread_* (so a static link to a single-threaded program still works), __gmon_start__, and several _ITM_* items. When the consumer of a libstdc++ .so is itself a static binary, none of those are defined anywhere — the spec expects them to fold to 0; we expected them to be resolvable.
  • TLS bootstrap: GCC emits weak refs to __cxa_thread_atexit_impl and friends so that the C++ runtime can optionally hook into glibc's per-thread cleanup machinery. On Nanvix there is no glibc, so these stay undefined; the program is designed to fall through, but only if the loader complies with the ABI.
  • CPython + numpy: the numpy .so (and most pip-installed wheels) ship weak refs to glibc-isms via libstdc++. These prevented us from running anything beyond pure-Python code on Nanvix and forced the toolchain wrapper to pass -Wl,--allow-shlib-undefined as a workaround, which papered over weak and strong undefined refs alike — defeating the linker's diagnostic value.

After this patch, .sos that exercise any of the above just work, and the toolchain wrapper can stop hiding link-time errors.

What changed

File Change
src/libs/elf/src/elf32.rs Add STB_LOCAL / STB_GLOBAL / STB_WEAK constants, ST_BIND_SHIFT, and Elf32Sym::st_bind() accessor.
src/libs/elf/src/relocation.rs Add typed SymbolBinding enum, Symbol::binding(), Symbol::is_weak(). Unknown bindings fall through to Other so they are never silently treated as weak. Five unit tests cover the helper.
src/libs/syscall/src/dlfcn/syscall/dynlib.rs (1) In get_symbol_value(), when lookup() == None, return Ok(0) if the referring symbol is SHN_UNDEF and STB_WEAK. Strong undefined refs still error. (2) In query(), skip SHN_UNDEF entries before consulting get_symbol_value() so dladdr() does not report ghost symbols at address 0.

Substituting zero is safe across the relocation types the loader currently implements (R_386_32, R_386_PC32, R_386_JMP_SLOT, R_386_GLOB_DAT): the resulting GOT / PLT entry or in-place 32-bit slot is null, so any code path that actually dereferences the symbol traps deterministically — matching the contract the spec puts on the program (it must null-check before use).

Specification

System V ABI ("gABI"), chapter Symbol Table:

Weak symbols resemble global symbols, but their definitions have lower precedence. [...] Undefined weak symbols do not generate an error if no resolving definition is found at link time, and references take the value zero.

Precedent in mainstream loaders

Every well-known ELF dynamic loader implements this rule. The contract is uniform; only the surrounding code style varies.

  • glibc (elf/dl-lookup.c, _dl_lookup_symbol_x): after iterating every scope, if current_value.s == NULL, glibc inspects the reference symbol's binding. If it is STB_WEAK, glibc fills the lookup result with a synthesized sym_val { 0, 0 } and proceeds; otherwise it raises an unresolved-symbol error. The i386 machine-specific reloc handler sysdeps/i386/dl-machine.h writes the resulting zero through R_386_GLOB_DAT / R_386_JMP_SLOT like any other relocation.
  • musl (ldso/dynlink.c, find_sym2 and the relocation loop in do_relocs): find_sym2 returns a sentinel with .sym = 0 on lookup failure; the relocation loop then checks ELF_ST_BIND(sym->st_info) == STB_WEAK and skips the error path entirely when the binding is weak.
  • FreeBSD rtld-elf (libexec/rtld-elf/rtld.c): find_symdef() returns a sym_zero placeholder for unresolved weak references, populated at rtld startup as a zero-valued symbol. All architecture-specific reloc handlers (reloc.c) treat that placeholder as a normal zero-valued definition.
  • Android Bionic (linker/linker_relocate.cpp): lookup_symbol() returns a null sym for unresolved refs, and the relocation loop branches on is_weak. Weak unresolved refs proceed with sym_addr = 0; strong unresolved refs report an error via DL_ERR.

The behaviour we now implement is the intersection of all four: unresolved weak undef → value 0; unresolved strong undef → loader error. Nothing more, nothing less.

Validation

  • In-tree unit tests in src/libs/elf/src/relocation.rs::tests cover binding decoding, the Other fallback for unknown bindings, the independence of the binding nibble from the type nibble, and that is_weak() and is_undefined() are orthogonal predicates.

  • dlfcn-weak-c regression suite (proposed at [dlfcn] E: Add dlfcn-weak-c tests for STB_WEAK loader semantics posix-tests#1) drives all the observable contract through the public dlfcn API:

    # Fixture Reloc Symbol Expected
    5 libstrong-missing.so R_386_JUMP_SLOT strong_missing dlopen(RTLD_NOW) fails
    1 libweak-func-resolved.so R_386_GLOB_DAT main_callback resolves to main exe
    2 libweak-func-missing.so R_386_GLOB_DAT missing_callback &fn == NULL
    3 libweak-data-resolved.so R_386_GLOB_DAT weak_data resolves to main exe
    4 libweak-data-missing.so R_386_GLOB_DAT missing_weak_data &data == NULL
    6 libweak-plt-resolved.so R_386_JUMP_SLOT main_callback resolves to main exe
    7 libweak-plt-missing.so R_386_JUMP_SLOT missing_plt_callback dlopen(RTLD_NOW) succeeds (no call)

    Case 5 — the strong-undefined regression guard — runs first and passes both before and after this patch, confirming that strong undef behaviour is unchanged. Cases 1-4 and 6-7 only pass with this patch applied. All 7 pass end-to-end on the Nanvix microvm against this branch.

  • CPython 3.12 + numpy 1.26.4 end-to-end: cpython links against libstdc++ (weak pthread_* etc. left unresolved at .so-load time), runs hello.py, then import numpy; numpy.array(...).sum() via dlopen() of the numpy .so family; the test harness prints NUMPY_TEST_OK.

Compatibility

  • No public API change. SymbolBinding is new but does not displace any existing helper.
  • The pre-existing strong-undef error path is preserved verbatim; only the previously-uniform error is split into "weak undef → 0" and "strong undef → error".
  • No change to relocation semantics for resolved symbols.
  • dladdr() behaviour improves (no more ghost zero-address symbols from undefined dynsym entries).

Per the System V ABI generic ABI (gABI, chapter "Symbol Table"), an
undefined symbol whose binding is `STB_WEAK` and which cannot be
resolved at dynamic-link time is silently taken to have the value zero
(or `NULL` for function symbols).  Every mainstream ELF dynamic
loader -- glibc `elf/dl-lookup.c`, musl `ldso/dynlink.c`, FreeBSD
`libexec/rtld-elf/rtld.c`, Android Bionic
`linker/linker_relocate.cpp` -- implements this contract; the program
is responsible for null-checking the symbol before use.

The Nanvix in-process loader's `get_symbol_value()` previously
returned `Err(BadFile, "symbol not found")` unconditionally when a
referenced symbol could not be resolved, regardless of binding.  As a
result, any `.so` containing a weak undefined reference -- which is
extremely common: libstdc++ keeps weak refs to `pthread_*` and
`__gmon_start__`, glibc-compatible compilers emit weak refs to
`__cxa_thread_atexit_impl` and TLS descriptors, and Rust crates often
expose optional integration hooks as weak symbols -- failed to
`dlopen()`, even though the program would have run correctly with the
spec-defined zero substitution.

This patch fills that gap:

  - `src/libs/elf/src/elf32.rs` adds the `STB_LOCAL` / `STB_GLOBAL` /
    `STB_WEAK` constants, the `ST_BIND_SHIFT` constant, and an
    `Elf32Sym::st_bind()` accessor that extracts the binding nibble.

  - `src/libs/elf/src/relocation.rs` lifts the raw `STB_*` values into
    a typed `SymbolBinding` enum (`Local` / `Global` / `Weak` /
    `Other`), wires `Symbol::binding()` through the goblin
    `st_bind()` helper, and adds an `is_weak()` predicate.  Unknown
    or reserved binding values fall through to `SymbolBinding::Other`
    so they are never silently treated as weak.  Five unit tests
    cover decoding, the `Other` fallback, the independence of the
    binding nibble from the type nibble, and that `is_weak()` /
    `is_undefined()` are orthogonal axes.

  - `src/libs/syscall/src/dlfcn/syscall/dynlib.rs::get_symbol_value()`
    intercepts the existing `lookup() == None` arm: if the referring
    symbol is both `SHN_UNDEF` and `STB_WEAK`, return `Ok(0)` and log
    at `debug!`.  Strong undefined references continue to return the
    pre-existing `BadFile` error, so `dlopen(RTLD_NOW)` still fails
    on a genuine missing strong dependency.

  - `src/libs/syscall/src/dlfcn/syscall/dynlib.rs::query()` skips
    undefined entries before consulting `get_symbol_value()`.  Without
    this, `dladdr()` would report a ghost symbol at address `0` for
    every weak undefined entry in the dynsym; that is correct
    behaviour for relocation but is not what `dladdr` is meant to
    surface.

Substituting zero is safe across the relocation types the loader
currently implements (`R_386_32`, `R_386_PC32`, `R_386_JMP_SLOT`,
`R_386_GLOB_DAT`): the resulting GOT / PLT entry or in-place 32-bit
slot is null, so any code path that actually dereferences the symbol
traps deterministically -- matching the contract the spec puts on the
program (it must null-check before use).

Validated end-to-end on the Nanvix microvm:

  - The new `dlfcn-weak-c` regression suite (proposed in
    nanvix/posix-tests, see esaurez/posix-tests#1) exercises all four
    weak-undefined relocation classes (GOT/PLT × function/data) in
    both resolved-via-main-exe and missing variants, plus a strong-
    undefined regression guard.  All 7 cases pass against this
    branch.  The strong-undefined case still fails `dlopen(RTLD_NOW)`,
    preserving the existing contract.

  - CPython 3.12 successfully links against libstdc++ with weak refs
    to `pthread_*` etc. left unresolved at .so-load time, runs
    `hello.py`, then `dlopen()`s and exercises numpy 1.26.4 to
    produce `NUMPY_TEST_OK` on the guest.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ppenna pushed a commit to nanvix/cpython that referenced this pull request Jun 8, 2026
Updates `Makefile.nanvix` so that `python.elf` correctly serves as the
"main module" against which extension `.so`s (numpy, ssl, lxml, future
pip-installed wheels, ...) resolve their C and C++ runtime symbols at
dlopen() time. This is the consumer-side companion to the Nanvix
loader's STB_WEAK support (esaurez/nanvix#22) and is gated on the new
libposix `pathconf` / `fpathconf` stubs (esaurez/nanvix#23) for the
configure conftest to even produce an executable.

Three coordinated link-flag changes to the `CONFIGURE_ENV` block:

  1. `LIBS` segment 1 -- new `--whole-archive ... --no-whole-archive`
     block ahead of the existing `--start-group`. Forces every object
     from libposix, libc, libm, libstdc++, and libgcc into python.elf
     so the runtime symbols extension `.so`s depend on are embedded
     (and re-exported via `-Wl,--export-dynamic`, already present).
     Without this, the static linker drops unreferenced objects
     (e.g. `fscanf`, `longjmp`, `strtold_l` for numpy; `operator
     new/delete[]`, `__cxa_*`, `_Unwind_*`, `std::type_info` vtables
     for any C++ extension) and subsequent dlopen() of those `.so`s
     fails with "symbol not found".

  2. `LIBS` segment 2 -- the existing `--start-group` is trimmed to
     just the external add-on libraries (sqlite3, ssl, crypto, z, bz2,
     lzma, ffi). It no longer re-lists libposix / libc / libm: those
     archives are already fully included by segment 1, so the external
     libs can resolve their references against the already-embedded
     objects.

  3. Two new top-level Makefile vars `LIBSTDCXX := -lstdc++` and
     `LIBGCC := -lgcc`. The GCC driver resolves them against its built-
     in search paths (libgcc lives under a versioned `lib/gcc/i686-
     nanvix/<gcc-version>/` directory, which would be fragile to
     hardcode). Defined once at top level because the `-l` form is
     identical between the docker and host build paths.

`LDFLAGS` is unchanged. The existing `-Wl,--allow-multiple-definition`
flag is kept and the surrounding comment is expanded to honestly
enumerate the duplicate-symbol categories the flag is masking (newlib
long-double math helpers, libposix/libc env+isatty overlaps, libc/libm
math helper overlaps, libgcc internal `__x86.get_pc_thunk.*`
duplicates, etc.) -- the set is large and toolchain-build-version-
dependent, and is the only practical workaround until the contributing
upstreams are fixed.

`.nanvix/config.py::configure_env()` -- an unused helper that mirrors
`Makefile.nanvix`'s `CONFIGURE_ENV` -- is kept in sync (same
`--whole-archive` LIBS, same LDFLAGS) and gains a docstring calling
out the dead-code status. A separate small cleanup PR can delete the
helper entirely.

Validated end-to-end on the Nanvix microvm: CPython 3.12 + numpy 1.26.4
runs `import numpy`, `np.arange`, `np.dot`, `reshape`, `flatten`,
broadcasting, all producing `NUMPY_TEST_OK`. Hello.py and the existing
single-process / multi-process / standalone modes are unaffected by
the change because the linker flags are not mode-conditional.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ppenna pushed a commit to nanvix/cpython that referenced this pull request Jun 10, 2026
Updates `Makefile.nanvix` so that `python.elf` correctly serves as the
"main module" against which extension `.so`s (numpy, ssl, lxml, future
pip-installed wheels, ...) resolve their C and C++ runtime symbols at
dlopen() time. This is the consumer-side companion to the Nanvix
loader's STB_WEAK support (esaurez/nanvix#22) and is gated on the new
libposix `pathconf` / `fpathconf` stubs (esaurez/nanvix#23) for the
configure conftest to even produce an executable.

Three coordinated link-flag changes to the `CONFIGURE_ENV` block:

  1. `LIBS` segment 1 -- new `--whole-archive ... --no-whole-archive`
     block ahead of the existing `--start-group`. Forces every object
     from libposix, libc, libm, libstdc++, and libgcc into python.elf
     so the runtime symbols extension `.so`s depend on are embedded
     (and re-exported via `-Wl,--export-dynamic`, already present).
     Without this, the static linker drops unreferenced objects
     (e.g. `fscanf`, `longjmp`, `strtold_l` for numpy; `operator
     new/delete[]`, `__cxa_*`, `_Unwind_*`, `std::type_info` vtables
     for any C++ extension) and subsequent dlopen() of those `.so`s
     fails with "symbol not found".

  2. `LIBS` segment 2 -- the existing `--start-group` is trimmed to
     just the external add-on libraries (sqlite3, ssl, crypto, z, bz2,
     lzma, ffi). It no longer re-lists libposix / libc / libm: those
     archives are already fully included by segment 1, so the external
     libs can resolve their references against the already-embedded
     objects.

  3. Two new top-level Makefile vars `LIBSTDCXX := -lstdc++` and
     `LIBGCC := -lgcc`. The GCC driver resolves them against its built-
     in search paths (libgcc lives under a versioned `lib/gcc/i686-
     nanvix/<gcc-version>/` directory, which would be fragile to
     hardcode). Defined once at top level because the `-l` form is
     identical between the docker and host build paths.

`LDFLAGS` is unchanged. The existing `-Wl,--allow-multiple-definition`
flag is kept and the surrounding comment is expanded to honestly
enumerate the duplicate-symbol categories the flag is masking (newlib
long-double math helpers, libposix/libc env+isatty overlaps, libc/libm
math helper overlaps, libgcc internal `__x86.get_pc_thunk.*`
duplicates, etc.) -- the set is large and toolchain-build-version-
dependent, and is the only practical workaround until the contributing
upstreams are fixed.

`.nanvix/config.py::configure_env()` -- an unused helper that mirrors
`Makefile.nanvix`'s `CONFIGURE_ENV` -- is kept in sync (same
`--whole-archive` LIBS, same LDFLAGS) and gains a docstring calling
out the dead-code status. A separate small cleanup PR can delete the
helper entirely.

Validated end-to-end on the Nanvix microvm: CPython 3.12 + numpy 1.26.4
runs `import numpy`, `np.arange`, `np.dot`, `reshape`, `flatten`,
broadcasting, all producing `NUMPY_TEST_OK`. Hello.py and the existing
single-process / multi-process / standalone modes are unaffected by
the change because the linker flags are not mode-conditional.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant