-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SortedDictionary<BigInteger, MyObject> System.AccessViolationException #108763
Comments
Tagging subscribers to this area: @dotnet/area-system-collections |
Please attach a self-contained repro or a crash dump, it's not possible to diagnose the issue from just the given snippet. It could be caused by a semi-valid unsafe/interop in your application (stack traces could be confusing in such cases) or a thread-safety issue (SortedDictionary is not thread safe). Also, is this Linux or Windows? BDN has a known issue on Linux + DisassemblyDiagnoser. |
Problem is. I tried using dotMemory to analyse the error. As soon as the memory profiler is attached the error is gone. I run on Windows 11. |
You can enable crashdumps by following https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps (use FullDump - it has the complete info) |
Thx here is the dump https://we.tl/t-zWlR3MhCbz. Sorry I had to use wetransfer ,github is blocking the upload. And I strongly believe the Error is within SortedDictionary<>.Keys in combination with ToList() or ToArray()
|
Details of the crash - bad GC pointer:
The crash is on .NET 9 RC1. We fixed several bugs that can lead to intermittent crashes like this one for .NET 9 RC2. Could you please upgrade to .NET 9 RC2 and see whether it is going to fix the crash?
Next time, you can create an issue at https://developercommunity.visualstudio.com/, attach the dump to it and link it from here. developercommunity does not have uploads limits like github and it has better privacy controls. |
The crash still remains with .Net 9 RC2. |
Could you please upload a .NET 9 RC2 crash dump? |
I uploaded the crashdump here: |
Thanks - the RC2 crash is exactly same as the RC1 crash. There is weird discrepancy between the state of the registers and the faulting address in
vs.
This suggests that hardware somehow accessed a different address than what the registers are pointing at. Are you able to reproduce this crash on more than one machine? One possible explanation of these symptoms is a hardware defect. |
Also, could you please try whether it reproduces with disabled tiered compilation (e.g. The faulting method has 4
|
I tried that with no effect, The error is still the same |
Thats highly unlikely. The system despite using windows 11 as an OS runs rock solid for weeks without reboot. A Hardware defect in this kind of area should have a negative impact on system stability in general, but I do not see it. The error did not appear on a second machine I tried. But the other machine is a complexly different CPU generation. Maybe it is something CPU or chipset specific. I did backport the solution to .net 8 as soon as the project runs with .net8 the error is gone. Even when Benchmark.net running with .net 9 environment
I can provide the whole solution if that is of any help. |
Ok, it is a good data point that it does not repro on the other machine. Here is another thing to try: Could you please try to add
Yes, that would help too. |
That did it! Is this enough information or still need the whole solution? |
Tagging subscribers to this area: @mangod9 |
The sequence that leads to the crash (#108763 (comment) is the full dump to look at):
I suspect that the bad crash inside WriteBarrier is somehow caused by CET-aware thread suspension interleaving with the above sequence. The crash does not repro when CET is disabled. @janvorli @VSadov Does it ring any bells? Could you please take it from here? |
What I don't understand though is that even if we hit whatever issue in the write barrier, then if we adjusted the context back to the SortedSet, we should have invoked the exception handling code, but the dump seems to indicate that it didn't happen.
@swtrse that would be great, it would enable me to debug why the managed exception handling didn't kick in. Running !analyze on the dump shows the crash was due to an attempt to write to address 0000017a27a803f. Since that's out of the null reference detection area, that's likely causing the AV. |
We have invoked some exception handling code.
I think you meant
|
Ah, right. We have called |
Right, incomplete cut&paste |
I did upload my whole sourcecode to |
Hijacking the caller should be ok, I think, even if it then tailcalls the barrier. |
The problem is that vectored exception handler returns to the location where the "ret" was located in the barrier and then the crash occurs, because there is no longer a ret. I guess that's how Windows implement the special hijacking - it just returns from the vectored exception handler. Since the return address should be fixed now, it would just re-run the ret instruction and return back to the caller. But if the return instruction is gone, it just ends up continuing execution of whatever is there. |
Would we have to suspend runtime in order to swap the barrier code though? |
is it possible? I thought we don't do tail calls for helper calls.. cc @jakobbotsch |
One way how CET can mess up hijacking is that Windows will try to unhijack as well. It will use the top of the shadow stack. This did cause problems before when our stashed hijack address (which captures return address from the actual stack) and shadow stack were not in sync. |
Now it is actually clear, as the return address hijacking using the special address is only used when CET is enabled. |
That is not a JIT-inserted call. StelemRef calls the barrier directly - as an FCALL [MethodImpl(MethodImplOptions.InternalCall)]
private static extern void WriteBarrier(ref object? dst, object? obj); Can these be tail-called? |
Figured that. |
An obvious way to fix this is to forbid tailcalling of Another thought is: When returning from the hijack handler, we perform the "undo the pop to hijacked call site" - that is specifically to hit the if (areShadowStacksEnabled)
{
// Undo the "pop", so that the ret could now succeed.
interruptedContext->Rsp = interruptedContext->Rsp - 8;
interruptedContext->Rip = origIp;
} I do not recall all the reasons why we do this. |
I think we need to hit the ret in order to pop from the shadow stack too. |
Can a write barrier be a source of an exception? e.g. nullref
Ah, I guess so then. It's not a helper call indeed in this case. |
Yes, it can. But we detect that in the EH code and pretend that the exception occurred in the caller of the barrier. |
I've verified that the fix I have suggested above (return from VEH when we find the IP is not in managed code - we can actually explicitly check if it is in any helper or just write barrier) works with the repro app. |
That is what I mean - instead of re-popping the context and relying on That would not require probing the return address for managedness or being in range of barriers. |
As in - unhijack current thread and return VEH_CONTINUE_EXECUTION ? I think that would work. Perhaps could check just for the write barrier as that is the only thing that can get its code replaced. Other cases where we tailcall FCALLs should be ok. (and the "incsspq" solution might not work if OS pushes some stuff on SSP before calling our handler) |
We basically don't need to do anything, just check the IP to be in a write barrier and return the VEH_CONTINUE_EXECUTION if it was, right after the runtime/src/coreclr/vm/excep.cpp Lines 6525 to 6533 in d6ca550
So it looks like this: if (areShadowStacksEnabled)
{
// OS should have fixed the SP value to the same as we`ve stashed for the hijacked thread
_ASSERTE(*(size_t *)interruptedContext->Rsp == (uintptr_t)pThread->GetHijackedReturnAddress());
if (IsIPInWriteBarrierCodeCopy(interruptedContext->Rip))
{
return VEH_CONTINUE_EXECUTION;
}
// When the CET is enabled, the interruption happens on the ret instruction in the calee.
// We need to "pop" rsp to the caller, as if the ret has consumed it.
interruptedContext->Rsp += 8;
} |
We need to clear the hijacked state before allowing the thread to “escape” above it. |
Actually, I've realized that better than ignoring hijacking in the write barrier would be to let it happen, but just change the |
Hmm, that doesn't work with CET as the return from VEH checks the Rip and it doesn't match. |
@VSadov It looks like your idea - keeping the context Rsp / Rip at the managed caller where we set it before syncing with GC and updating Ssp in the context is the best we can do here. I gave it a quick try and it seems to work fine. |
cc @mangod9 . |
I think this was fixed in #109074 May need to backport to 9.0 though. |
Yes, that's why I have left it open. It didn't make it to the 9.0 GA, so we will ship it in the first servicing release. |
Hi, did this get fixed in 9.0.2? |
Description
HI,
I ran into a problem were using SortedDictionary<BigInteger, MyObject> ran into a System.AccessViolationException every single time.
The error occured during an BenchmarkDotNet run.
The error does not occur in debug mode. As far as I found out the error only appeared in Release mode.
Also, the error disappears if you want to analyze the behavior with dotMemory.
Reproduction Steps
MyObject is a sealed generic class bit I think it does not matter.
Expected behavior
I expect no AccessViolationException when calling Framework functions
Actual behavior
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at System.Collections.Generic.List
1[[System.Numerics.BigInteger, System.Runtime.Numerics, Version=9.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a]]..ctor(System.Collections.Generic.IEnumerable
1<System.Numerics.BigInteger>)Regression?
No response
Known Workarounds
No response
Configuration
// BenchmarkDotNet v0.14.0
// Runtime=.NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256
Other information
No response
The text was updated successfully, but these errors were encountered: