Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FreeBSD tests crash with HRESULT: 0x8007FF02 from #106724 #106892

Closed
Thefrank opened this issue Aug 23, 2024 · 16 comments · Fixed by #107100
Closed

FreeBSD tests crash with HRESULT: 0x8007FF02 from #106724 #106892

Thefrank opened this issue Aug 23, 2024 · 16 comments · Fixed by #107100
Labels
area-VM-coreclr in-pr There is an active PR which will close this issue when it is merged os-freebsd FreeBSD OS
Milestone

Comments

@Thefrank
Copy link
Contributor

Description

From #106724 forward tests on FreeBSD now all creash with HRESULT: 0x8007FF02

These do not show up in /var/log/messages on either the jail or the host. dmesg also free of errors from this. Nothing SIGXXXX.
Builds otherwise complete without issue.

Reproduction Steps

As initially discovered:
./build.sh -ci -c Release -subset Clr+Mono+Host.Native+Host.Tools+Host.Pkg+Libs+Libs.Tests+Packs --test
Double checked with (requires #106302):
src/tests/build.sh Release /p:LibrariesConfiguration=Release -rebuild -runtests

Expected behavior

Tests to run either pass/fail

Actual behavior

"exit code 137 means SIGKILL Killed either due to out of memory/resources (see /var/log/messages) or by explicit kill."

or

"Failed to create CoreCLR, HRESULT: 0x8007FF02"

Regression?

Yes

Known Workarounds

None?

Configuration

FreeBSD-x64 both 13.3 and 14.1
System has about ~200G of free RAM during this. Jails run without resource quotas.

Other information

Public log for both FreeBSD 13.3 and 14.1: https://dev.azure.com/IFailAt/freebsd-dotnet-runtime-nightly/_build/results?buildId=1625&view=results BinLog are published to artifacts

commit listed is 6df7807 (after PR listed)

git bisect log

git bisect start
# status: waiting for both good and bad commits
# good: [21fe57174af159f7bcb39f591cd9fbb96a36bfcb] COMObject invoke through reflection (#106677)
git bisect good 21fe57174af159f7bcb39f591cd9fbb96a36bfcb
# status: waiting for bad commit, 1 good commit known
# bad: [6df7807617cff3e4b4eba7bb33c8ad00c2b1cf1c] Add missing .alt_entry to CoreCLR *_FakeProlog methods (#106744)
git bisect bad 6df7807617cff3e4b4eba7bb33c8ad00c2b1cf1c
# good: [a3cbf8c2534040faa4664e45c19a0ba877bb5c5c] Grab back some performance in `RuntimeTypeHandle::GetElementType` (#106730)
git bisect good a3cbf8c2534040faa4664e45c19a0ba877bb5c5c
# good: [4e8892ea65fe85593b1853907d965fb49d324ce9] JIT: Add Non-Faulting behaviour for Sve.LoadVector*NonFaulting*() (#106648)
git bisect good 4e8892ea65fe85593b1853907d965fb49d324ce9
# good: [cfa0cc5b0282c29597f10b6c2fdceab159e0be81] Inject IJW Copy Constructor calls in the JIT instead of in the IL stubs (#106424)
git bisect good cfa0cc5b0282c29597f10b6c2fdceab159e0be81
# good: [cfa8da560f674d035a26e9b3b07067933d46e44c] Avoid signed overflow in DBG_FlushInstructionCache (#105918)
git bisect good cfa8da560f674d035a26e9b3b07067933d46e44c
# bad: [92e41b6b2cf7aea845fcfc9a439dc4e191c07428] Add callback-based Encode to AsnWriter
git bisect bad 92e41b6b2cf7aea845fcfc9a439dc4e191c07428
# bad: [27ee590d1f96dc590f8ad85e76b7b8a6ad000b2d] pal init: InitializeFlushProcessWriteBuffers() before first thread to improve start time (#106724)
git bisect bad 27ee590d1f96dc590f8ad85e76b7b8a6ad000b2d
# first bad commit: [27ee590d1f96dc590f8ad85e76b7b8a6ad000b2d] pal init: InitializeFlushProcessWriteBuffers() before first thread to improve start time (#106724)
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 23, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Aug 23, 2024
@harisokanovic
Copy link
Contributor

@Thefrank , can you recommend a guide for building the v9 SDK on FreeBSD? https://github.com/dotnet/dotnet/ fails with ./prep-source-build.sh: line 199: /dotnet-dotnet/.dotnet/dotnet: No such file or directory.

@Thefrank
Copy link
Contributor Author

Building the SDK requires bootstraping from Linux or using an old SDK to build the new one. If you just want a prebuilt we have those too.

Building:
Bootstraping will require: #105004 #105587 and optionally #105587. Native builds only need #105587. If you want a collection of scripts for native building, there is this

I have not tried to build from VMR for net9 as the VMR tends to require more work than "bootstrap from tags" or "use old SDK to build new SDK" :(

Downloading:
Both I and @sec have net9p7 SDKs prebuilt: Here or Here

The former repo is what I use for my builds; the later by @sec also has ARM64 versions and will work just as well for either AMD64 or ARM64. I cherry-pick missing/needed PRs into my SDKs (e.g., the ones listed above). If you want an SDK with 0 changes @sec's are better.

Please let me know how else I can help with this!

@JulieLeeMSFT JulieLeeMSFT added os-freebsd FreeBSD OS and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Aug 23, 2024
@JulieLeeMSFT
Copy link
Member

Not a codegen issue. Removing codegen label.

@JulieLeeMSFT JulieLeeMSFT added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Aug 23, 2024
@jkotas jkotas added area-VM-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Aug 24, 2024
@cperciva
Copy link

FreeBSD dev here. Just to confirm, this is running in native FreeBSD, not via the Linux emulation layer?

@cperciva
Copy link

Does this process ever call execve? The "process has called MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED" flag gets cleared when we execve and that would result in MEMBARRIER_CMD_PRIVATE_EXPEDITED returning EPERM later.

@Thefrank
Copy link
Contributor Author

It is running on native FreeBSD (no linux kernel modules, linux service not running) on both host and jail. allow.mlock=1 is set for the jail where the build system is located.

@cperciva
Copy link

@Thefrank And does is call execve at any point?

@Thefrank
Copy link
Contributor Author

@cperciva Yes. I have attached a truss.txt (truss -fadeH -o truss.txt /root/runtime/artifacts/log/repro/ilasm/PortablePdb/IlasmPortablePdbTests/repro_IlasmPortablePdbTests.sh). Test selected at random.
truss.txt

@cperciva
Copy link

@Thefrank Is that one of the tests which is crashing? I don't see any process exiting with code 137.

@Thefrank
Copy link
Contributor Author

Thefrank commented Aug 27, 2024

@cperciva All but seven of the tests are failing with this HRESULT. The truss output is for one of the failing tests. There is nothing SIGXXXX being thrown. /var/log/messages has nothing from the tests or build just things related to the jail starting. dmesg on the host has nothing for these tests. This HRESULT is usually one of two things: OOM or in the case of FreeBSD jails, allow.mlock not being on.
Normally when a test crashes I will at least see something in dmesg like this:

pid 29217 (dotnet), jid 11, uid 0: exited on signal 6 (core dumped)
pid 4517 (dotnet), jid 11, uid 0: exited on signal 6 (core dumped)
pid 42537 (dotnet), jid 11, uid 0: exited on signal 6 (core dumped)

I have attached the repro log which is a bit easier to read than the raw output from the build
repro.txt

edit: A bit of a follow up, the repro.txt does not show exit 137 either. It does show up in the build log (e.g., https://dev.azure.com/IFailAt/freebsd-dotnet-runtime-nightly/_build/results?buildId=1625&view=logs&j=7a7d9a85-6408-51cd-9969-a6e11cb53059&t=61d9617a-4ab8-5d55-8d34-1c6cdb462292&l=7503) after a "Failed to create CoreCLR, HRESULT: 0x8007FF02". I guess the test runner is seeing that HRESULT as a SIGKILL/OOM?

@harisokanovic
Copy link
Contributor

harisokanovic commented Aug 28, 2024

The dotnet process appears to be exiting after this mmap() failure. Can someone run this in gdb and get a call stack? I still cannot build on FreeBSD.

49096 102527: 0.089291910 mmap(0x0,0,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) ERR#22 'Invalid argument'
49096 102527: 0.089427113 write(2,"BEGIN: coreclr_initialize failed"...,53) = 53 (0x35)
49096 102527: 0.089511925 fstat(1,{ mode=crw--w---- ,inode=336,size=0,blksize=4096 }) = 0 (0x0)
49096 102527: 0.089600958 ioctl(1,TIOCGETA,0x3d975150a894) = 0 (0x0)
49096 102527: 0.089693696 write(1,"Exe path: /root/runtime/artifact"...,92) = 92 (0x5c)
49096 102527: 0.089779735 write(1,"Properties:\n",12) = 12 (0xc)
49096 102527: 0.089956849 write(1,"    TRUSTED_PLATFORM_ASSEMBLIES "...,4096) = 4096 (0x1000)
49096 102527: 0.090115491 write(1,"clr/freebsd.x64.Release/Tests/Co"...,4096) = 4096 (0x1000)
49096 102527: 0.090255647 write(1,"System.Management.dll:/root/runt"...,4096) = 4096 (0x1000)
49096 102527: 0.090394600 write(1,"ease/Tests/Core_Root/System.Comp"...,4096) = 4096 (0x1000)
49096 102527: 0.090531941 write(1,"dll:/root/runtime/artifacts/test"...,4096) = 4096 (0x1000)
49096 102527: 0.090698105 write(1,"Collections.Concurrent.dll:/root"...,4096) = 4096 (0x1000)
49096 102527: 0.090841009 write(1,"em.Console.dll:/root/runtime/art"...,4096) = 4096 (0x1000)
49096 102527: 0.090927921 write(1,"lease/Tests/Core_Root/System.Con"...,938) = 938 (0x3aa)
49096 102527: 0.091013966 write(1,"    APP_PATHS = /root/runtime/ar"...,115) = 115 (0x73)
49096 102527: 0.091099685 write(1,"    NATIVE_DLL_SEARCH_DIRECTORIE"...,211) = 211 (0xd3)
49096 102527: 0.091184257 write(1,"    System.Reflection.Metadata.M"...,67) = 67 (0x43)
49096 102527: 0.091295183 write(1,"    System.Runtime.Serialization"...,81) = 81 (0x51)
49096 102527: 0.091381122 write(1,"    HOST_RUNTIME_CONTRACT = 0x3d"...,43) = 43 (0x2b)
49096 102527: 0.091473361 write(1,"Managed assembly: /root/runtime/"...,142) = 142 (0x8e)
49096 102527: 0.091564080 write(1,"Arguments (0): \n",16) = 16 (0x10)
49096 102527: 0.091649785 write(2,"END: coreclr_initialize failed -"...,51) = 51 (0x33)
49096 102527: 0.092297198 exit(0xffffffff)
49096 102527: 0.092347537 process exit, rval = 4294967295

@harisokanovic
Copy link
Contributor

Also, are these related to the following known issues? #100558 and dotnet/dnceng#2496

@jkotas
Copy link
Member

jkotas commented Aug 28, 2024

Can someone run this in gdb and get a call stack?

It is going to be the call here:

s_helperPage = static_cast<int*>(mmap(0, GetVirtualPageSize(), PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0));

The problem is that value returned by GetVirtualPageSize is not initialized yet now that this method is called earlier. We need to initialize it earlier.

A related problem is why MEMBARRIER_CMD_QUERY does not kick on FreeBSD. I believe that MEMBARRIER syscall should be implemented on FreeBSD: https://www.freebsd.org/status/report-2021-10-2021-12/membarrier-rseq/, but it is not working for some reason. We end up using the fallback path that has a bug.

are these related to the following known issues?

These are unrelated.

harisokanovic pushed a commit to harisokanovic/dotnet_runtime that referenced this issue Aug 28, 2024
…ialize()

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes dotnet#106892
Fixes dotnet#106722
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Aug 28, 2024
harisokanovic pushed a commit to harisokanovic/dotnet_runtime that referenced this issue Aug 28, 2024
…ialize()

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes dotnet#106892
Fixes dotnet#106722
@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Aug 28, 2024
@mangod9 mangod9 added this to the 10.0.0 milestone Aug 28, 2024
janvorli pushed a commit that referenced this issue Aug 28, 2024
…ialize() (#107100)

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes #106892
Fixes #106722

Co-authored-by: Haris Okanovic <[email protected]>
github-actions bot pushed a commit that referenced this issue Aug 28, 2024
…ialize()

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes #106892
Fixes #106722
@jkotas
Copy link
Member

jkotas commented Aug 28, 2024

@Thefrank The fix is in. Could you please verify that FreeBSD is functional again?

Also, it would be useful to find out why MEMBARRIER_CMD_QUERY path does not work on FreeBSD. MEMBARRIER_CMD_QUERY is important for Arm64. If this path does not work on Arm64, the runtime will have reliability issues. I would not be opposed to disabling the fallback path for Arm64, so that we do not have to chase the Arm64 intermittent crashes caused by it.

jkotas pushed a commit that referenced this issue Aug 29, 2024
…ialize() (#107114)

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes #106892
Fixes #106722

Co-authored-by: Haris Okanovic <[email protected]>
@Thefrank
Copy link
Contributor Author

@jkotas Looks good on my end!

I don't have an ARM64 hardware setup to test on but @sec might have some insight on FreeBSD-ARM64 with .NET.

Under FreeBSD-x64, I have not noticed any consistent errors*

Not sure what to do about the MEMBARRIER_CMD_QUERY issue either. There is no manpage for either membarrier(2) or rseq(2) on FreeBSD and source/differential might be a pain to dig through. If reliability is a concern then it sounds like disabling the fallback path might be for the best.

*I have noticed an inconsistent error when restore is run during the initial build of runtime. I will open an issue about this when I can better reproduce it.

@sec
Copy link
Contributor

sec commented Aug 29, 2024

I will try to find some time and do the checks/tests on arm64.

edit: fix looks fine also on arm64.

jtschuster pushed a commit to jtschuster/runtime that referenced this issue Sep 17, 2024
…ialize() (dotnet#107100)

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes dotnet#106892
Fixes dotnet#106722

Co-authored-by: Haris Okanovic <[email protected]>
@github-actions github-actions bot locked and limited conversation to collaborators Oct 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr in-pr There is an active PR which will close this issue when it is merged os-freebsd FreeBSD OS
Projects
None yet
7 participants