-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8362193: Re-work MacOS/AArch64 SpinPause to handle SB #26387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…in-wait insns range checks
👋 Welcome back eastigeevich! A progress list of the required criteria for merging this PR into |
@eastig This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 48 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think it is not very reasonable to pull things into shared parts, when that deviates significantly from the common Hotspot way of handling things. I don't think it even saves us any duplication to move implementations to spin_wait_aarch64.(inline.)pp
? So I suggest we don't.
* @requires vm.flagless | ||
* @requires os.arch=="aarch64" | ||
* @requires vm.cpu.features ~= ".*sb.*" | ||
* @run main/native GTestWrapper --gtest_filter=SpinPause* -XX:+UnlockDiagnosticVMOptions -XX:OnSpinWaitInst=sb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This configuration is for @requires vm.cpu.features ~= ".*sb.*"
, correct?
If so, you can put this entire @test
block in the TestSpinPauseAArch64.java
, then jtreg would just execute them as subtests. Here is an inspiration: https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/gc/shenandoah/TestAllocObjects.java
* questions. | ||
*/ | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** | |
/* |
@@ -114,10 +115,11 @@ define_pd_global(intx, InlineSmallCode, 1000); | |||
"Use prfm hint with specified distance in compiled code." \ | |||
"Value -1 means off.") \ | |||
range(-1, 4096) \ | |||
product(ccstr, OnSpinWaitInst, "yield", DIAGNOSTIC, \ | |||
product(ccstr, OnSpinWaitInst, DEFAULT_SPIN_WAIT_INST, DIAGNOSTIC, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is somewhat weird to introduce references to macros here. At some point, this might lead to interesting include circularities, as I think globals.*
are supposed to be at the leaves of include trees. Other options do not do this, see for example UseBranchProtection
below. So, keep this part intact?
Do I understand you correctly we should not have |
Yes. I don't see a benefit for them? Collecting everything related to spin wait implementation is not worth it, IMO. We have duplication by asm-stubs, macroassembler, c1_assembler, c2.ad rules for the arch-specific code already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More reviews.
I think the PR drifted from my initial "oh, just assert range problem" to "Re-work MacOS/AArch64 spin-waits to handle SB". Rewrite the JBS/PR title accordingly?
*/ | ||
|
||
/* | ||
* @test TestSpinPauseAArch64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A common way to tag the tests:
@test id=default
I assume all these are supported in ARMv8.0 (default? baseline?) profile. That's why I put default
. Put something else if that is incorrect.
Same for the second @test
block.
return SpinWait(SpinWait::SB, OnSpinWaitInstCount); | ||
} else if (strcmp(OnSpinWaitInst, "none") != 0) { | ||
vm_exit_during_initialization("The options for OnSpinWaitInst are nop, isb, yield, sb, and none", OnSpinWaitInst); | ||
if (!SpinWait::supports(OnSpinWaitInst)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is not about the actual spin-wait hints support, correct? This only checks the option string is in expected domain? A common way to deal with this is to use constraint functions, see for example:
product(ccstr, AOTCache, nullptr, \
"Cache for improving start up and warm up") \
constraint(AOTCacheConstraintFunc, AtParse) \
\
assert(inst_id == 0 || is_power_of_2(inst_id), "Values of SpinWait::Inst must be 0 or power of 2"); | ||
assert(inst_id != SpinWait::SB || VM_Version::supports_sb(), "current CPU does not support SB instruction"); | ||
if (inst_id > SpinWait::NOP) { | ||
warining("Unsupported type of SpinWait::Inst: %d", inst_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning
Also, I think we want to minimize the amount of code we actually execute in SpinPause
, since it likely sits in the hot loop. So checking this here is probably counter-productive.
Webrevs
|
" nop \n" | ||
"4: \n" | ||
: | ||
: [id]"r"(inst_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no good reason to handle all this logic in asm. Please use a switch statement instead, and we can also get rid of most of the assertions by adding a ShouldNotReachHere() in the default clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I proposed to use the switch when JDK-8321371 was being reviewed: #16994 (comment)
Frederick (@fbredber) wanted to avoid branches: #16994 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The switch-based version is committed: e984fde
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so this inline assembly was to optimize SpinPause
, since it sits in hot loop. Have you looked at disassembly for SpinPause
before/after? On my M1, I see this:
% lldb -o "disassemble -n SpinPause" -o quit -- build/macosx-aarch64-server-release/images/jdk/lib/server/libjvm.dylib
# Before
libjvm.dylib`::SpinPause():
libjvm.dylib[0x89f0d4] <+0>: stp x29, x30, [sp, #-0x10]!
libjvm.dylib[0x89f0d8] <+4>: mov x29, sp
libjvm.dylib[0x89f0dc] <+8>: adrp x8, 1409
libjvm.dylib[0x89f0e0] <+12>: add x8, x8, #0x80 ; VM_Version::_spin_wait
libjvm.dylib[0x89f0e4] <+16>: ldrsw x8, [x8]
libjvm.dylib[0x89f0e8] <+20>: lsl x8, x8, #3
libjvm.dylib[0x89f0ec] <+24>: adr x9, 0x89f100 ; <+44> at os_bsd_aarch64.cpp:545:5
libjvm.dylib[0x89f0f0] <+28>: add x9, x9, x8
libjvm.dylib[0x89f0f4] <+32>: br x9
libjvm.dylib[0x89f0f8] <+36>: b 0x89f114 ; <+64> at os_bsd_aarch64.cpp:561:5
libjvm.dylib[0x89f0fc] <+40>: nop
libjvm.dylib[0x89f100] <+44>: nop
libjvm.dylib[0x89f104] <+48>: b 0x89f114 ; <+64> at os_bsd_aarch64.cpp:561:5
libjvm.dylib[0x89f108] <+52>: isb
libjvm.dylib[0x89f10c] <+56>: b 0x89f114 ; <+64> at os_bsd_aarch64.cpp:561:5
libjvm.dylib[0x89f110] <+60>: yield
libjvm.dylib[0x89f114] <+64>: mov w0, #0x1 ; =1
libjvm.dylib[0x89f118] <+68>: ldp x29, x30, [sp], #0x10
libjvm.dylib[0x89f11c] <+72>: ret
# After
libjvm.dylib`::SpinPause():
libjvm.dylib[0x89f074] <+0>: stp x29, x30, [sp, #-0x10]!
libjvm.dylib[0x89f078] <+4>: mov x29, sp
libjvm.dylib[0x89f07c] <+8>: adrp x8, 1409
libjvm.dylib[0x89f080] <+12>: add x8, x8, #0x80 ; VM_Version::_spin_wait
libjvm.dylib[0x89f084] <+16>: ldr w8, [x8]
libjvm.dylib[0x89f088] <+20>: add w8, w8, #0x1
libjvm.dylib[0x89f08c] <+24>: cmp w8, #0x4
libjvm.dylib[0x89f090] <+28>: b.hi 0x89f0ec ; <+120> at os_bsd_aarch64.cpp:551:7
libjvm.dylib[0x89f094] <+32>: adrp x9, 0
libjvm.dylib[0x89f098] <+36>: add x9, x9, #0xfc ; ___lldb_unnamed_symbol66913
libjvm.dylib[0x89f09c] <+40>: adr x10, 0x89f09c ; <+40> at os_bsd_aarch64.cpp
libjvm.dylib[0x89f0a0] <+44>: ldrsw x11, [x9, x8, lsl #2]
libjvm.dylib[0x89f0a4] <+48>: add x10, x10, x11
libjvm.dylib[0x89f0a8] <+52>: br x10
libjvm.dylib[0x89f0ac] <+56>: nop
libjvm.dylib[0x89f0b0] <+60>: mov w0, #0x1 ; =1
libjvm.dylib[0x89f0b4] <+64>: ldp x29, x30, [sp], #0x10
libjvm.dylib[0x89f0b8] <+68>: ret
libjvm.dylib[0x89f0bc] <+72>: isb
libjvm.dylib[0x89f0c0] <+76>: mov w0, #0x1 ; =1
libjvm.dylib[0x89f0c4] <+80>: ldp x29, x30, [sp], #0x10
libjvm.dylib[0x89f0c8] <+84>: ret
libjvm.dylib[0x89f0cc] <+88>: yield
libjvm.dylib[0x89f0d0] <+92>: mov w0, #0x1 ; =1
libjvm.dylib[0x89f0d4] <+96>: ldp x29, x30, [sp], #0x10
libjvm.dylib[0x89f0d8] <+100>: ret
libjvm.dylib[0x89f0dc] <+104>: sb
libjvm.dylib[0x89f0e0] <+108>: mov w0, #0x1 ; =1
libjvm.dylib[0x89f0e4] <+112>: ldp x29, x30, [sp], #0x10
libjvm.dylib[0x89f0e8] <+116>: ret
libjvm.dylib[0x89f0ec] <+120>: adrp x0, 1063
libjvm.dylib[0x89f0f0] <+124>: add x0, x0, #0xe2a ; "src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp"
libjvm.dylib[0x89f0f4] <+128>: mov w1, #0x227 ; =551
libjvm.dylib[0x89f0f8] <+132>: bl 0x311f84 ; report_should_not_reach_here at debug.cpp:247
So I think switch is fairly well compiled. On first glance, it generates more code by duplicating the epilog for every case, but I think that is a bit cleaner than trying to do branch-overs. It generates marginally better code if you place case
in enum order, and do should_not_reach
here branch only for debug builds:
#ifdef ASSERT
default:
ShouldNotReachHere();
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. When writing inline asm, it doesn't much help to try to out-guess the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my hand written assembly with the current default YIELD: https://godbolt.org/z/3YjxzW4sW
SpinPause(SpinWait::Inst):
mov w8, w0
tbz w8, #0, .Ltmp0
yield
b .Ltmp1
.Ltmp0:
.Ltmp1:
ret
Iterations: 100
Instructions: 500
Total Cycles: 152
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original assembly with the current default YIELD: https://godbolt.org/z/McWGM4f44
SpinPause(SpinWait::Inst):
lsl w8, w0, #3
sxtw x8, w8
adr x9, #20
add x9, x9, x8
br x9
b .Ltmp1
nop
nop
b .Ltmp1
nop
b .Ltmp1
yield
.Ltmp1:
ret
Iterations: 100
Instructions: 1300
Total Cycles: 254
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the default YIELD we have:
- Original: ~250 clocks
- Compiler binary search tree: ~200 clocks (-20%)
- Optimized with
TBZ
: ~150 clocks (-40%)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the differences might be smaller on the real hardware. Maybe everything will be around 200 clocks.
If we need code easy to maintain then this is switch
. If we need performance then this is tbz
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I did quite a lot of performance measurements before I settled on the assembler solution. Have you made any comparison before and after changing from the assembler code to the new c++ code? If so what tests did you run? Since the code is called in tight locking loops, this code really matters.
Given that this routine is a backoff delay, it's not clear that speeding it up helps.
However, if we really wanted to speed this up we'd use an indirect branch to one of four code blocks: isb; ret
would be one of them. But I don't think we do want to speed it up.
Do you not think we should implement the Arm-recommended solution in https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm? |
When we get armv8.7 hardware, I'd like to try their solution. |
OK, that makes sense.
I don't think that matters. We should experiment when we have hardware to find a sensible default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right then, looks good to me.
/reviewers 2
Background
With JDK-8359435 "AArch64: add support for SB instruction to MacroAssembler::spin_wait" we have an option to use the speculation barrier (SB) instruction for
j.l.Thread::onSpinWait
andSpinPause
. On Linux AArch64SpinPause
uses a stub which is generated withMacroAssembler::spin_wait
.j.l.Thread::onSpinWait
uses it as well. As a result tests forj.l.Thread::onSpinWait
, e.g.compiler/onSpinWait/TestOnSpinWaitAArch64.java
, cover bothj.l.Thread::onSpinWait
andSpinPause
. Also we don't need to updateSpinPause
when we add support for a new instruction toMacroAssembler::spin_wait
.On Mac AArch64
SpinPause
does not use the generated stub to avoid a costly call toos::current_thread_enable_wx()
. It uses inline assembly instead. As a result we haveSpinWait
implementation details leaking intoSpinPause
. Besides asserts there are no tests covering Mac implementation ofSpinPause
. Testing on Apple M3 Pro showed thatcompiler/onSpinWait/TestOnSpinWaitAArch64.java
could not reliably trigger those asserts.ArchiveWorkers::run_task_multi
did not invokespin.wait()
. The inline assembly in Mac AArch64SpinPause
has another issue. It uses a jump table. An offset in the table is calculated based on the value ofenum SpinWait::Inst
. When a new valueSB
was added the offset became out of bounds. The jump goes out of the assembly code.Summary of changes
SpinPause
not to use a jump table.SpinWait::supports()
andSpinWait::from_name()
methods to validate and convert instruction names into corresponding enum values.SpinWait::Inst
enum to use bit flags for simplified assembly testing and added a constructor to initializeSpinWait
using instruction names.src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp
: Added an assertion to ensure thesb
instruction is supported by the CPU before generation.OnSpinWaitInstNameConstraintFunc
for theOnSpinWaitInst
option to check option's value at parsing.SpinPause
.SpinPause
with various instructions, including thesb
instruction if supported by the CPU.Testing results: fastdebug, release
test/hotspot/jtreg/gtest/TestSpinPauseAArch64.java
: Passedtest/hotspot/jtreg/compiler/onSpinWait
: Passedtest/hotspot/jtreg/gtest/TestSpinPauseAArch64.java
: Passedtest/hotspot/jtreg/compiler/onSpinWait
: PassedProgress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26387/head:pull/26387
$ git checkout pull/26387
Update a local copy of the PR:
$ git checkout pull/26387
$ git pull https://git.openjdk.org/jdk.git pull/26387/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26387
View PR using the GUI difftool:
$ git pr show -t 26387
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26387.diff
Using Webrev
Link to Webrev Comment