-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime crashes on OSX running compiler unit tests #97186
Comments
Tagging subscribers to this area: @dotnet/gc Issue DetailsDescriptionThe runtime is crashing on OSX when running the compiler unit tests. There is no specific test that causes the crash but the crash is very consistent (approaching 100%). Unfortunately, due to this being a OSX dump, the dump files are too large for us to upload a full dump. The best we can get is a mini dump and json report: This is blocking our ability to test on OSX because there is no specific test to disable to work around the crash that we can see. The same tests pass just fine on Windows and Linux. Reproduction StepsRun the C# compiler tests in CI. Any build in this pipeline with the Expected behaviorUnit tests pass Actual behaviorCrashes with the following stack trace:
Regression?Yes this is a recent regression that came around the time of adopting the .NET 8 SDK GA Known WorkaroundsNone ConfigurationComplicated Other informationAt the moment we are collecting OSX dumps with the following settings: DOTNET_DbgEnableMiniDump=1
DOTNET_DbgMiniDumpName=/cores/crash.%d.%e.dmp
DOTNET_DbgMiniDumpType=1
DOTNET_EnableCrashReport=1 Happy to adjust this as need to help with the investigation.
|
I have got mac devices, is it possible to reproduce these locally? |
GC Regions aren't enabled for OSX? |
We tried running the tests locally on Macs and the issue did not reproduce. Unsure if it's specific to this hardware or some other factor. It reproduces very reliably in CI. |
Yes, it is not enabled for OSX, because of the large dump issue. |
Then maybe the issue can be reproduced on other platforms by disabling regions? |
We certainly can disable region using |
Do we know how to read the dump? When I tried to open the dump in Visual Studio, it complained wrong file format. When I tried to open it in WinDBG, it reported 25 threads, The crash report is more interesting, it does indicate
The |
My understanding from chatting with @hoyosjs is that Mac dumps are only openable on Mac.
I queued up a PR that disables this for Linux. That should be fairly apples to apples with our Mac runs. If it does crash the dump should upload cause it doesn't have the same size issues that Mac's do. |
Got a complete run on Linux with |
That's sad, which means we have to work with OSX then. Is it possible to capture stress log and/or run with customized |
I see local test crashes as well after I sync & rebuild runtime (after git clean -dfx)
I'm not 100% sure if this is the same issue but I can provide more info if needed. |
@wfurt, can we have the native stack? |
this?
I can reproduce while running under lldb. I'm working on binary search now to isolate breaking change. |
@wfurt, yes, this is what I meant by native stack. This doesn't look like the GC issue we are investigating though, instead,this looks like a managed exception. |
OK. The part I'm trying understand is what it would crash the whole test process. I tried several test suites and I get the same results. I could try the coreclr tests as well as Mac is my primary machine. |
Any updates here? This bug is blocking our OSX testing as it results in a near 100% failure rate. |
I wasn't actively working on it for the last few days. I am happy to if I can get myself some useful information that I use. |
We've offered to change our repo to give any information possible. We are bound by the limits of Helix so full dumps aren't possible.
This is not very encouraging. There appears to be a runtime bug that blocks the compiler from executing correctly on one of our key operating systems. |
Is it possible to run the CI with an instrumented runtime that emit logs instead of dump files to workaround the limitation that dumps are unavailable? |
The OSX queues are failing at virtually 100% right now. Shutting them down until .NET runtime makes progress on the issue. dotnet/runtime#97186
Yes. Is there a runtime available that is instrumented that we can use + instructions on how to use it? |
Thank you! It looks like the run is completed, but the only failure is "Validate Generated Syntax Files" failing with a "The system cannot find the path specified." on Windows? |
@cshung Yes. I've re-run it a few times and it's passing every time. Worried whatever timing issues exists is essentially undone by adding printf into the code. Trying a few more runs. |
Is there some way we can observe that the new GC is actually in use in the pipeline? |
I don't know ... can we? 😄
The log is not easy to find when tests pass. Have to start digging through the Helix API directly to find it. J |
Tracked down the console log and it does seem like something is going sideways here. Our mac logic which was working one week ago is now no longer hitting. Yay infra. Digging into it. |
Just a link for easy access. |
can this be closed now since this hasn't reproed in a while? |
Yeah. Can re-open if we see it again. |
I think this happens regularly in roslyn-CI main runs - all these red circles are failing macOS legs: |
@cshung, do you still have the private with logging which you had built to diagnose these? |
Yes, the instrumentation branch is still there |
Is there any update on the investigation here? Have we been able to run with instrumentation enabled? |
This actually stopped happening when roslyn was updated from .NET 8 to .NET 9 SDK in dotnet/roslyn#73408. |
so is it ok to close for now and reopen if it occurs again? |
Does that mean we have an 8.0 bug still? |
100% yes. |
@mangod9 can someone on your team look at this? I think this might meet the 8.0 servicing bar since it's LTS |
I think the challenge has been lack of dumps from MacOS, hence @cshung had added some logging to diagnose, but if the issue is not reproing we cant investigate further (unless there is a way to move the toolset back to 8 and try to repro) |
It seems like you could manually queue a run with that change. |
I am more than happy to try to figure out what is wrong here. What I can do:
What I don't know how to:
@jaredpar and I used to have a branch here where I can update and push a I don't know if it is feasible, but it would be great to be able to freeze the Roslyn code and the dotnet used for running so that the bug will hopefully not suddenly go away by itself. |
Description
The runtime is crashing on OSX when running the compiler unit tests. There is no specific test that causes the crash but the crash is very consistent (approaching 100%).
Unfortunately, due to this being a OSX dump, the dump files are too large for us to upload a full dump. The best we can get is a mini dump and json report:
This is blocking our ability to test on OSX because there is no specific test to disable to work around the crash that we can see. The same tests pass just fine on Windows and Linux.
Reproduction Steps
Run the C# compiler tests in CI. Any build in this pipeline with the
main
branch filter will demonstrate the crash.https://dnceng-public.visualstudio.com/public/_build?definitionId=95&_a=summary&branchFilter=319%2C319
Expected behavior
Unit tests pass
Actual behavior
Crashes with the following stack trace:
Regression?
Yes this is a recent regression that came around the time of adopting the .NET 8 SDK GA
Known Workarounds
None
Configuration
Complicated
Other information
At the moment we are collecting OSX dumps with the following settings:
Happy to adjust this as need to help with the investigation.
The text was updated successfully, but these errors were encountered: