Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SortedDictionary<BigInteger, MyObject> System.AccessViolationException #108763

Open
swtrse opened this issue Oct 10, 2024 · 57 comments
Open

SortedDictionary<BigInteger, MyObject> System.AccessViolationException #108763

swtrse opened this issue Oct 10, 2024 · 57 comments
Assignees
Labels
area-VM-coreclr regression-from-last-release tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@swtrse
Copy link

swtrse commented Oct 10, 2024

Description

HI,

I ran into a problem were using SortedDictionary<BigInteger, MyObject> ran into a System.AccessViolationException every single time.
The error occured during an BenchmarkDotNet run.
The error does not occur in debug mode. As far as I found out the error only appeared in Release mode.
Also, the error disappears if you want to analyze the behavior with dotMemory.

Reproduction Steps

MyObject is a sealed generic class bit I think it does not matter.

public sealed class Test
{
    private readonly SortedDictionary<BigInteger, MyObject> _limitSortedList;

    .......

    private MyObject? GetNextNode(BigInteger target)
    {
	if (_limitSortedList.TryGetValue(target, out Swtrse2Node<TKey, TOrder>? node)) { return node; }

	var low = 0;
	int high = _limitSortedList.Count - 1;
	List<BigInteger> list = _limitSortedList.Keys.ToList();   //<---- Exception

	while (low <= high)
	{
		int mid = (low + high) / 2;
		int comparison = _limitSortedList.Comparer.Compare(list[mid], target);

		if (comparison < 0) { low = mid + 1; }
		else { high = mid - 1; }
	}

	return low < _limitSortedList.Count ? _limitSortedList[list[low]] : null;
    }
}

Expected behavior

I expect no AccessViolationException when calling Framework functions

Actual behavior

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at System.Collections.Generic.List1[[System.Numerics.BigInteger, System.Runtime.Numerics, Version=9.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]]..ctor(System.Collections.Generic.IEnumerable1<System.Numerics.BigInteger>)

Regression?

No response

Known Workarounds

No response

Configuration

// BenchmarkDotNet v0.14.0
// Runtime=.NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256

Other information

No response

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 10, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member

EgorBo commented Oct 10, 2024

Please attach a self-contained repro or a crash dump, it's not possible to diagnose the issue from just the given snippet. It could be caused by a semi-valid unsafe/interop in your application (stack traces could be confusing in such cases) or a thread-safety issue (SortedDictionary is not thread safe). Also, is this Linux or Windows? BDN has a known issue on Linux + DisassemblyDiagnoser.

@swtrse
Copy link
Author

swtrse commented Oct 11, 2024

Problem is. I tried using dotMemory to analyse the error. As soon as the memory profiler is attached the error is gone.
I see no option to generate a crashdump in any way.

I run on Windows 11.
I do not use any unsafe code.

@jkotas
Copy link
Member

jkotas commented Oct 11, 2024

I see no option to generate a crashdump in any way.

You can enable crashdumps by following https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps (use FullDump - it has the complete info)

@swtrse
Copy link
Author

swtrse commented Oct 11, 2024

Thx here is the dump https://we.tl/t-zWlR3MhCbz. Sorry I had to use wetransfer ,github is blocking the upload.

And I strongly believe the Error is within SortedDictionary<>.Keys in combination with ToList() or ToArray()
Justification:

  1. If I change the code to a version without calling ToList() or ToArray() the error is gone
  2. If I change SortedDictionary<> to SortedList<> the error is gone
  3. The error happens instantly after starting the application. I suspect on one of the first calls.

@jkotas
Copy link
Member

jkotas commented Oct 11, 2024

Details of the crash - bad GC pointer:

System_Collections!System.Collections.Generic.SortedSet<System.Collections.Generic.KeyValuePair<System.Numerics.BigInteger,System.__Canon>>.InOrderTreeWalk+0x135:
00007ffe`b79fdbb5 ff4514          inc     dword ptr [rbp+14h] ss:0000013a`1638abfc=00002213

The crash is on .NET 9 RC1. We fixed several bugs that can lead to intermittent crashes like this one for .NET 9 RC2. Could you please upgrade to .NET 9 RC2 and see whether it is going to fix the crash?

Thx here is the dump https://we.tl/t-zWlR3MhCbz.

Next time, you can create an issue at https://developercommunity.visualstudio.com/, attach the dump to it and link it from here. developercommunity does not have uploads limits like github and it has better privacy controls.

@swtrse
Copy link
Author

swtrse commented Oct 15, 2024

The crash still remains with .Net 9 RC2.

@jkotas
Copy link
Member

jkotas commented Oct 15, 2024

Could you please upload a .NET 9 RC2 crash dump?

@swtrse
Copy link
Author

swtrse commented Oct 16, 2024

@jkotas
Copy link
Member

jkotas commented Oct 16, 2024

Thanks - the RC2 crash is exactly same as the RC1 crash.

There is weird discrepancy between the state of the registers and the faulting address in EXCEPTION_RECORD in both dumps.

00007ffe`b79fdbb5 ff4514          inc     dword ptr [rbp+14h] ss:0000013a`1638abfc=00002213

vs.

0:024> dx -r1 ((coreclr!_EXCEPTION_RECORD *)0x38c4efc4f0)
((coreclr!_EXCEPTION_RECORD *)0x38c4efc4f0)                 : 0x38c4efc4f0 [Type: _EXCEPTION_RECORD *]
    [+0x000] ExceptionCode    : 0xc0000005 [Type: unsigned long]
    [+0x004] ExceptionFlags   : 0x1 [Type: unsigned long]
    [+0x008] ExceptionRecord  : 0x0 [Type: _EXCEPTION_RECORD *]
    [+0x010] ExceptionAddress : 0x7ffc351acd62 [Type: void *]
    [+0x018] NumberParameters : 0x2 [Type: unsigned long]
    [+0x020] ExceptionInformation [Type: unsigned __int64 [15]]
0:024> dx -r1 (*((coreclr!unsigned __int64 (*)[15])0x38c4efc510))
(*((coreclr!unsigned __int64 (*)[15])0x38c4efc510))                 [Type: unsigned __int64 [15]]
    [0]              : 0x1 [Type: unsigned __int64]
    [1]              : 0x24eba52b1a6 [Type: unsigned __int64] <- This should be same as [rbp+14h] ss:0000013a`1638abfc

This suggests that hardware somehow accessed a different address than what the registers are pointing at.

Are you able to reproduce this crash on more than one machine? One possible explanation of these symptoms is a hardware defect.

@jkotas
Copy link
Member

jkotas commented Oct 16, 2024

Also, could you please try whether it reproduces with disabled tiered compilation (e.g. DOTNET_TieredCompilation=0 environment variable)?

The faulting method has 4 OptimizedTier1OSR variants of the code that is unusual. It may be related.

  IL Addr:            00007fff8c6cc624
     CodeAddr:           00007ffeb79e5ca0  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB5F00
     CodeAddr:           00007ffeb79e5fa0  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB5C40
     CodeAddr:           00007ffeb79fcb40  (QuickJitted + Instrumented)
     NativeCodeVersion:  0000013A0078C9D0
     CodeAddr:           00007ffeb79fd220  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB6180
     CodeAddr:           00007ffeb79fda80  (OptimizedTier1OSR)
     NativeCodeVersion:  0000013A11EB5AC0
     CodeAddr:           0000000000000000  (OptimizedTier1)
     NativeCodeVersion:  0000013A00794FC0
     CodeAddr:           00007ffeb79e3e50  (QuickJitted)
     NativeCodeVersion:  0000000000000000

@swtrse
Copy link
Author

swtrse commented Oct 16, 2024

Also, could you please try whether it reproduces with disabled tiered compilation (e.g. DOTNET_TieredCompilation=0 environment variable)?

I tried that with no effect, The error is still the same

@swtrse
Copy link
Author

swtrse commented Oct 17, 2024

This suggests that hardware somehow accessed a different address than what the registers are pointing at.

Are you able to reproduce this crash on more than one machine? One possible explanation of these symptoms is a hardware defect.

Thats highly unlikely. The system despite using windows 11 as an OS runs rock solid for weeks without reboot. A Hardware defect in this kind of area should have a negative impact on system stability in general, but I do not see it.

The error did not appear on a second machine I tried. But the other machine is a complexly different CPU generation. Maybe it is something CPU or chipset specific.

I did backport the solution to .net 8 as soon as the project runs with .net8 the error is gone. Even when Benchmark.net running with .net 9 environment

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.14.0
// Runtime=.NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
// Job: Job-NHHCHO(Runtime=.NET 9.0, Toolchain=InProcessEmitToolchain, InvocationCount=1, IterationCount=3, LaunchCount=1, UnrollFactor=1, WarmupCount=3)

I can provide the whole solution if that is of any help.

@jkotas
Copy link
Member

jkotas commented Oct 17, 2024

Ok, it is a good data point that it does not repro on the other machine. Here is another thing to try: Could you please try to add <PropertyGroup> <CETCompat>false</CETCompat> </PropertyGroup> to your .csproj file and check whether it makes the problem to go away? (CET is a hardware security feature that we have enabled in .NET 9 by default when it is available - https://learn.microsoft.com/en-us/dotnet/core/compatibility/interop/9.0/cet-support has more details.)

I can provide the whole solution if that is of any help.

Yes, that would help too.

@swtrse
Copy link
Author

swtrse commented Oct 17, 2024

Ok, it is a good data point that it does not repro on the other machine. Here is another thing to try: Could you please try to add <PropertyGroup> <CETCompat>false</CETCompat> </PropertyGroup> to your .csproj file and check whether it makes the problem to go away? (CET is a hardware security feature that we have enabled in .NET 9 by default when it is available - https://learn.microsoft.com/en-us/dotnet/core/compatibility/interop/9.0/cet-support has more details.)

That did it!
With this setting the error is gone.
I only did set it in the console app that's running the benchmarks, and the error is no more. I was prepared to add that switch to all libraries in the solution, but it was not necessary.

Is this enough information or still need the whole solution?

@jkotas jkotas added area-VM-coreclr regression-from-last-release tenet-reliability Reliability/stability related issue (stress, load problems, etc.) and removed area-System.Collections labels Oct 17, 2024
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented Oct 17, 2024

The sequence that leads to the crash (#108763 (comment) is the full dump to look at):

  • SortedSet code called CastHelpers.StelemRef
  • CastHelpers.StelemRef tail-called WriteBarrier
  • We hit some sort of bad crash inside WriteBarrier
  • We unwound from the write-barrier and patched the exception context that makes it look like the crash is in SortedSet code

I suspect that the bad crash inside WriteBarrier is somehow caused by CET-aware thread suspension interleaving with the above sequence. The crash does not repro when CET is disabled.

@janvorli @VSadov Does it ring any bells? Could you please take it from here?

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 17, 2024
@mangod9 mangod9 added this to the 10.0.0 milestone Oct 17, 2024
@janvorli
Copy link
Member

What I don't understand though is that even if we hit whatever issue in the write barrier, then if we adjusted the context back to the SortedSet, we should have invoked the exception handling code, but the dump seems to indicate that it didn't happen.
I think that the fact that the exception location is not the one in the write barrier but instead at its caller is just a side effect of us unwinding write barrier in the exception context we get from the OS and watson seeing that modified context.

I can provide the whole solution if that is of any help.

@swtrse that would be great, it would enable me to debug why the managed exception handling didn't kick in. Running !analyze on the dump shows the crash was due to an attempt to write to address 0000017a27a803f. Since that's out of the null reference detection area, that's likely causing the AV.

@jkotas
Copy link
Member

jkotas commented Oct 17, 2024

we should have invoked the exception handling code

We have invoked some exception handling code. Thread.m_ExceptionState.m_pCurrentTracker.m_ptrs has the information about the crash.

Running !analyze on the dump shows the crash was due to an attempt to write to address 0000017a27a803f.

I think you meant 0x0000017a27a803f1. It is address of RegionToGeneration map in the write barrier. I do not see how we ended up writing to that address.

CORINFO_HELP_ASSIGN_REF:
00007ffe`b6da0010 488911          mov     qword ptr [rcx],rdx
00007ffe`b6da0013 4c8bc1          mov     r8,rcx
00007ffe`b6da0016 48b8f103a8277a010000 mov rax,17A27A803F1h <----
00007ffe`b6da0020 48c1e916        shr     rcx,16h
00007ffe`b6da0024 803c0100        cmp     byte ptr [rcx+rax],0
00007ffe`b6da0028 7504            jne     00007ffe`b6da002e
...

@janvorli
Copy link
Member

Ah, right. We have called EEPolicy::HandleFatalError in FailFastIfCorruptingStateException and so Watson took the context passed to the HandleFatalError as the one to show as the top frame.

@janvorli
Copy link
Member

I think you meant 0x0000017a27a803f1.

Right, incomplete cut&paste

@swtrse
Copy link
Author

swtrse commented Oct 17, 2024

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

Hijacking the caller should be ok, I think, even if it then tailcalls the barrier.

@janvorli
Copy link
Member

janvorli commented Oct 18, 2024

The problem is that vectored exception handler returns to the location where the "ret" was located in the barrier and then the crash occurs, because there is no longer a ret. I guess that's how Windows implement the special hijacking - it just returns from the vectored exception handler. Since the return address should be fixed now, it would just re-run the ret instruction and return back to the caller. But if the return instruction is gone, it just ends up continuing execution of whatever is there.

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

Would we have to suspend runtime in order to swap the barrier code though?

@EgorBo
Copy link
Member

EgorBo commented Oct 18, 2024

The key is that the write barrier is tail called from a managed method

is it possible? I thought we don't do tail calls for helper calls.. cc @jakobbotsch

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

And I don't know how CET would influence that since we should be using the APC both with and without CET enabled.

One way how CET can mess up hijacking is that Windows will try to unhijack as well. It will use the top of the shadow stack. This did cause problems before when our stashed hijack address (which captures return address from the actual stack) and shadow stack were not in sync.

@janvorli
Copy link
Member

Now it is actually clear, as the return address hijacking using the special address is only used when CET is enabled.

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

The key is that the write barrier is tail called from a managed method

is it possible? I thought we don't do tail calls for helper calls.. cc @jakobbotsch

That is not a JIT-inserted call. StelemRef calls the barrier directly - as an FCALL

        [MethodImpl(MethodImplOptions.InternalCall)]
        private static extern void WriteBarrier(ref object? dst, object? obj);

Can these be tail-called?
(I think that still would be ok, just wonder if this is the reason why StelemRef is special here)

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

Would we have to suspend runtime in order to swap the barrier code though?

Figured that.
The runtime will indeed be suspended before we resume from the handler. There will be an opportunity for the barrier code to change.

@VSadov
Copy link
Member

VSadov commented Oct 18, 2024

An obvious way to fix this is to forbid tailcalling of WriteBarrier. That might regress perf of array element writes.

Another thought is:

When returning from the hijack handler, we perform the "undo the pop to hijacked call site" - that is specifically to hit the ret that is in this case problematic.

        if (areShadowStacksEnabled)
        {
            // Undo the "pop", so that the ret could now succeed.
            interruptedContext->Rsp = interruptedContext->Rsp - 8;
            interruptedContext->Rip = origIp;
        }

I do not recall all the reasons why we do this.
If it is just to match the shadow stack and let both stacks pop naturally, perhaps we could pop the shadow stack instead? This would ensure that we resume to a safe point.

@janvorli
Copy link
Member

I think we need to hit the ret in order to pop from the shadow stack too.
As for fixing the issue, wouldn't just returning from the vectored exception handler if we find the address is not managed code sufficient? That seems like the simplest thing that should work.

@EgorBo
Copy link
Member

EgorBo commented Oct 18, 2024

Can a write barrier be a source of an exception? e.g. nullref

Can these be tail-called?

Ah, I guess so then. It's not a helper call indeed in this case.

@janvorli
Copy link
Member

Can a write barrier be a source of an exception? e.g. nullref

Yes, it can. But we detect that in the EH code and pretend that the exception occurred in the caller of the barrier.

@janvorli
Copy link
Member

I've verified that the fix I have suggested above (return from VEH when we find the IP is not in managed code - we can actually explicitly check if it is in any helper or just write barrier) works with the repro app.

@VSadov
Copy link
Member

VSadov commented Oct 19, 2024

I think we need to hit the ret in order to pop from the shadow stack too.

That is what I mean - instead of re-popping the context and relying on ret to pop both stacks, we could just pop the shadow stack (i.e. incsspq)

That would not require probing the return address for managedness or being in range of barriers.

@VSadov
Copy link
Member

VSadov commented Oct 19, 2024

As for fixing the issue, wouldn't just returning from the vectored exception handler if we find the address is not managed code sufficient? That seems like the simplest thing that should work.

As in - unhijack current thread and return VEH_CONTINUE_EXECUTION ?

I think that would work. Perhaps could check just for the write barrier as that is the only thing that can get its code replaced. Other cases where we tailcall FCALLs should be ok.

(and the "incsspq" solution might not work if OS pushes some stuff on SSP before calling our handler)

@janvorli
Copy link
Member

We basically don't need to do anything, just check the IP to be in a write barrier and return the VEH_CONTINUE_EXECUTION if it was, right after the _ASSERTE in the code below:

if (areShadowStacksEnabled)
{
// OS should have fixed the SP value to the same as we`ve stashed for the hijacked thread
_ASSERTE(*(size_t *)interruptedContext->Rsp == (uintptr_t)pThread->GetHijackedReturnAddress());
// When the CET is enabled, the interruption happens on the ret instruction in the calee.
// We need to "pop" rsp to the caller, as if the ret has consumed it.
interruptedContext->Rsp += 8;
}

So it looks like this:

        if (areShadowStacksEnabled)
        {
            // OS should have fixed the SP value to the same as we`ve stashed for the hijacked thread
            _ASSERTE(*(size_t *)interruptedContext->Rsp == (uintptr_t)pThread->GetHijackedReturnAddress());

            if (IsIPInWriteBarrierCodeCopy(interruptedContext->Rip))
            {
                return VEH_CONTINUE_EXECUTION;
            }

            // When the CET is enabled, the interruption happens on the ret instruction in the calee.
            // We need to "pop" rsp to the caller, as if the ret has consumed it.
            interruptedContext->Rsp += 8;
        }

@VSadov
Copy link
Member

VSadov commented Oct 19, 2024

We need to clear the hijacked state before allowing the thread to “escape” above it.
It is permitted for a thread to clear its hijack, as long as it is in coop mode, but not permitted to simply ignore it. On AOT a thread seen in “escaped” state may cause asserts, not sure about core.
The reason is that removing the hijack later could become dangerous, if the thread ends up below the hijack again, but with different stack.

@janvorli
Copy link
Member

Actually, I've realized that better than ignoring hijacking in the write barrier would be to let it happen, but just change the origIp to a known location of a ret instruction. We can just add an asm helper that would have just ret and set it there. Then we can still run the GC on the hijack, but return from the vectored exception handler would always hit return and the right thing would happen.

@janvorli
Copy link
Member

Hmm, that doesn't work with CET as the return from VEH checks the Rip and it doesn't match.

@janvorli
Copy link
Member

@VSadov It looks like your idea - keeping the context Rsp / Rip at the managed caller where we set it before syncing with GC and updating Ssp in the context is the best we can do here. I gave it a quick try and it seems to work fine.

@JulieLeeMSFT
Copy link
Member

cc @mangod9 .

@VSadov
Copy link
Member

VSadov commented Oct 23, 2024

I think this was fixed in #109074

May need to backport to 9.0 though.

@janvorli
Copy link
Member

Yes, that's why I have left it open. It didn't make it to the 9.0 GA, so we will ship it in the first servicing release.

@adamjones2
Copy link
Contributor

Hi, did this get fixed in 9.0.2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-VM-coreclr regression-from-last-release tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

No branches or pull requests

8 participants