Skip to content

SOS '!ClrStack' fails after SoftwareExceptionFrame on net11 macOS triage dumps (cDAC missing R2R odd-entrypoint fast path) #5910

Description

@max-charlamb

Summary

After we enabled cDAC default-on for net11 #5874 , SOSMethodTests.Reflection started failing intermittently on Linux/macOS in CI for the net11 triage dump iteration. Same test passes for net8/9/10 and for non-triage (heap) dumps. Across the public + internal diagnostics pipelines in the last 7 days I counted ~12 failures, all with the same shape.

Failing leg: lldb + dotnet-dump-create-triage style minidump (DOTNET_DbgMiniDumpType=3) on net11 preview, x64.

Symptom

SOS !ClrStack on the crashing thread produces only:

OS Thread Id: 0x274bc (0)
        Child SP               IP Call Site
00007FF7B9EDBE40 00007ff81604aada [SoftwareExceptionFrame: 00007ff7b9edbe40]
<failed>
Stack Walk failed. Reported stack incomplete.

The walk yields exactly one frame (the SEF) and then aborts. The test asserts on later frames (Reflection.MethodBaseInvoker.InvokeWithNoArgs, etc.) and fails.

Root cause

The cDAC's R2R MethodDesc lookup in
src/native/managed/cdac/.../ExecutionManagerCore.ReadyToRunJitManager.GetMethodDescForRuntimeFunction
is missing the AMD64/x86 "odd entry point" fast path that the legacy DAC has in
src/coreclr/vm/readytoruninfo.cpp (ReadyToRunInfo::GetMethodDescForEntryPointInNativeImage):

#if defined(TARGET_AMD64) || defined(TARGET_X86)
    // A normal method entry point is always 8 byte aligned, but a funclet can start
    // at an odd address. Since PtrHashMap can't handle odd pointers, check for this
    // case and return NULL.
    if ((entryPoint & 0x1) != 0)
        return NULL;
#endif
    TADDR val = m_entryPointToMethodDescMap.LookupValueByUniqueKey(PCODEToPINSTR(entryPoint));

(The comment is misleading — PtrHashMap can handle odd keys; they just always return INVALIDENTRY. The bail is purely a perf optimization that skips a known-useless probe.)

What that means for triage dumps

The producer-side DAC's EnumMem for triage dumps drives stack walking through the same JitCodeToMethodInfo path, which calls GetMethodDescForEntryPoint. For odd entry points (funclets), the legacy DAC returns NULL without ever probing the hashmap. The associated bucket pages are therefore never read by the producer and never enumerated into the triage dump.

The cDAC consumer, lacking the fast path, does probe the hashmap for odd entry points. The hash lands in a bucket page that the producer never enumerated → unmapped in the dump → VirtualReadException reading the bucket → exception propagates up through StackWalk_1.NextClrDataStackWalk.MoveNextLegacyVisible → SOS reports <failed>.

Concrete trace from a captured failing dump

SOS.ReflectionTest.Triage.dmp from public build 1483142 (macOS x64, net11 preview):

[cdac] IsManaged(ip=0x107a8ec4d)         ← recovered post-SEF managed IP
[cdac] HashMap.GetValue map=0x107704588
                       key=0x107a8ec2d   ← funclet entryPoint, low bit SET
                       size=431
                       buckets=0x7fa3d2022640
                       seed=1105869579 incr=309
[cdac]   probe i=0 slot=297 bucketAddr=0x7fa3d2027080
[cdac] ClrDataStackWalk.Next() EXCEPTION:
       VirtualReadException: Failed to read pointer at 0x7fa3d2027080
         at HashMapLookup.GetValue
         at PtrHashMapLookup.GetValue
         at ReadyToRunJitManager.GetMethodDescForRuntimeFunction
         at ReadyToRunJitManager.AdjustRuntimeFunctionToMethodStart
         at ReadyToRunJitManager.GetMethodInfo
         at ExecutionManagerCore.GetCodeBlockHandle
         at StackWalk_1.IsManaged
         at StackWalk_1.UpdateState
         at StackWalk_1.Next
         at ClrDataStackWalk.MoveNextLegacyVisible

Bucket array page map in the dump:

Page Status
0x...22xxx ✓ in dump
0x...23xxx ✓ in dump
0x...24xxx ✗ missing
0x...25xxx ✓ in dump
0x...26xxx ✓ in dump
0x...27xxx ✗ missing (this is where slot 297 lands)
0x...28xxx ✗ missing
0x...29xxx ✗ missing

The producer's normal hashmap probes for non-funclet IPs touched pages 22/23/25/26 (those got captured). Page 27 was never touched because the producer's only odd-entryPoint probe (for the funclet) was bailed by the fast path.

Fix

Add the same AMD64/x86 odd-pointer bail to the cDAC's GetMethodDescForRuntimeFunction. Runtime PR: dotnet/runtime# (will link).

Validated locally with the captured failing dump:

Build Result
Baseline cDAC (no fix) dies after 1 frame
Fixed cDAC full 8-frame managed stack with file/line numbers

After the runtime PR merges and rides into the diagnostics package via the standard flow, the failing Reflection tests should go green on net11.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions