Skip to content

8362193: Re-work MacOS/AArch64 SpinPause to handle SB #26387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

eastig
Copy link
Member

@eastig eastig commented Jul 18, 2025

Background

With JDK-8359435 "AArch64: add support for SB instruction to MacroAssembler::spin_wait" we have an option to use the speculation barrier (SB) instruction for j.l.Thread::onSpinWait and SpinPause. On Linux AArch64 SpinPause uses a stub which is generated with MacroAssembler::spin_wait. j.l.Thread::onSpinWait uses it as well. As a result tests for j.l.Thread::onSpinWait, e.g.compiler/onSpinWait/TestOnSpinWaitAArch64.java, cover both j.l.Thread::onSpinWait and SpinPause. Also we don't need to update SpinPause when we add support for a new instruction to MacroAssembler::spin_wait.

On Mac AArch64 SpinPause does not use the generated stub to avoid a costly call to os::current_thread_enable_wx(). It uses inline assembly instead. As a result we have SpinWait implementation details leaking into SpinPause. Besides asserts there are no tests covering Mac implementation of SpinPause. Testing on Apple M3 Pro showed that compiler/onSpinWait/TestOnSpinWaitAArch64.java could not reliably trigger those asserts. ArchiveWorkers::run_task_multi did not invoke spin.wait(). The inline assembly in Mac AArch64 SpinPause has another issue. It uses a jump table. An offset in the table is calculated based on the value of enum SpinWait::Inst. When a new value SB was added the offset became out of bounds. The jump goes out of the assembly code.

Summary of changes

  • Rewrote Mac AArch64 SpinPause not to use a jump table.
  • Added SpinWait::supports() and SpinWait::from_name() methods to validate and convert instruction names into corresponding enum values.
  • Updated SpinWait::Inst enum to use bit flags for simplified assembly testing and added a constructor to initialize SpinWait using instruction names.
  • src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp: Added an assertion to ensure the sb instruction is supported by the CPU before generation.
  • Added OnSpinWaitInstNameConstraintFunc for the OnSpinWaitInst option to check option's value at parsing.
  • Added a gtest to verify the functionality of SpinPause.
  • Added a jtreg test to run the gtest forSpinPause with various instructions, including the sb instruction if supported by the CPU.

Testing results: fastdebug, release

  • Linux, Graviton 2
    • tier1: Passed
    • test/hotspot/jtreg/gtest/TestSpinPauseAArch64.java: Passed
    • test/hotspot/jtreg/compiler/onSpinWait: Passed
  • MacOS, M3 Pro
    • tier1, OnSpinWaitInst=default: Passed
    • tier1, OnSpinWaitInst=sb: Passed
    • test/hotspot/jtreg/gtest/TestSpinPauseAArch64.java: Passed
    • test/hotspot/jtreg/compiler/onSpinWait: Passed

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8362193: Re-work MacOS/AArch64 SpinPause to handle SB (Bug - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26387/head:pull/26387
$ git checkout pull/26387

Update a local copy of the PR:
$ git checkout pull/26387
$ git pull https://git.openjdk.org/jdk.git pull/26387/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26387

View PR using the GUI difftool:
$ git pr show -t 26387

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26387.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 18, 2025

👋 Welcome back eastigeevich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 18, 2025

@eastig This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8362193: Re-work MacOS/AArch64 SpinPause to handle SB

Reviewed-by: shade, aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 48 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Jul 18, 2025

@eastig The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think it is not very reasonable to pull things into shared parts, when that deviates significantly from the common Hotspot way of handling things. I don't think it even saves us any duplication to move implementations to spin_wait_aarch64.(inline.)pp? So I suggest we don't.

* @requires vm.flagless
* @requires os.arch=="aarch64"
* @requires vm.cpu.features ~= ".*sb.*"
* @run main/native GTestWrapper --gtest_filter=SpinPause* -XX:+UnlockDiagnosticVMOptions -XX:OnSpinWaitInst=sb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration is for @requires vm.cpu.features ~= ".*sb.*", correct?

If so, you can put this entire @test block in the TestSpinPauseAArch64.java, then jtreg would just execute them as subtests. Here is an inspiration: https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/gc/shenandoah/TestAllocObjects.java

* questions.
*/

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
/*

@@ -114,10 +115,11 @@ define_pd_global(intx, InlineSmallCode, 1000);
"Use prfm hint with specified distance in compiled code." \
"Value -1 means off.") \
range(-1, 4096) \
product(ccstr, OnSpinWaitInst, "yield", DIAGNOSTIC, \
product(ccstr, OnSpinWaitInst, DEFAULT_SPIN_WAIT_INST, DIAGNOSTIC, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is somewhat weird to introduce references to macros here. At some point, this might lead to interesting include circularities, as I think globals.* are supposed to be at the leaves of include trees. Other options do not do this, see for example UseBranchProtection below. So, keep this part intact?

@eastig
Copy link
Member Author

eastig commented Jul 18, 2025

Overall, I think it is not very reasonable to pull things into shared parts, when that deviates significantly from the common Hotspot way of handling things. I don't think it even saves us any duplication to move implementations to spin_wait_aarch64.(inline.)pp? So I suggest we don't.

Do I understand you correctly we should not have exec_spin_wait_inst and generate_spin_wait?

@shipilev
Copy link
Member

shipilev commented Jul 18, 2025

Overall, I think it is not very reasonable to pull things into shared parts, when that deviates significantly from the common Hotspot way of handling things. I don't think it even saves us any duplication to move implementations to spin_wait_aarch64.(inline.)pp? So I suggest we don't.

Do I understand you correctly we should not have exec_spin_wait_inst and generate_spin_wait?

Yes. I don't see a benefit for them? Collecting everything related to spin wait implementation is not worth it, IMO. We have duplication by asm-stubs, macroassembler, c1_assembler, c2.ad rules for the arch-specific code already.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More reviews.

I think the PR drifted from my initial "oh, just assert range problem" to "Re-work MacOS/AArch64 spin-waits to handle SB". Rewrite the JBS/PR title accordingly?

*/

/*
* @test TestSpinPauseAArch64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common way to tag the tests:

@test id=default

I assume all these are supported in ARMv8.0 (default? baseline?) profile. That's why I put default. Put something else if that is incorrect.

Same for the second @test block.

return SpinWait(SpinWait::SB, OnSpinWaitInstCount);
} else if (strcmp(OnSpinWaitInst, "none") != 0) {
vm_exit_during_initialization("The options for OnSpinWaitInst are nop, isb, yield, sb, and none", OnSpinWaitInst);
if (!SpinWait::supports(OnSpinWaitInst)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is not about the actual spin-wait hints support, correct? This only checks the option string is in expected domain? A common way to deal with this is to use constraint functions, see for example:

  product(ccstr, AOTCache, nullptr,                                         \
          "Cache for improving start up and warm up")                       \
          constraint(AOTCacheConstraintFunc, AtParse)                       \
                                                                            \

assert(inst_id == 0 || is_power_of_2(inst_id), "Values of SpinWait::Inst must be 0 or power of 2");
assert(inst_id != SpinWait::SB || VM_Version::supports_sb(), "current CPU does not support SB instruction");
if (inst_id > SpinWait::NOP) {
warining("Unsupported type of SpinWait::Inst: %d", inst_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning

Also, I think we want to minimize the amount of code we actually execute in SpinPause, since it likely sits in the hot loop. So checking this here is probably counter-productive.

@eastig eastig changed the title 8362193: compiler/onSpinWait/TestOnSpinWaitAArch64.java asserts on spin-wait insns range checks 8362193: Re-work MacOS/AArch64 SpinPause to handle SB Jul 18, 2025
@eastig eastig marked this pull request as ready for review July 20, 2025 13:42
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 20, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 20, 2025

Webrevs

" nop \n"
"4: \n"
:
: [id]"r"(inst_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no good reason to handle all this logic in asm. Please use a switch statement instead, and we can also get rid of most of the assertions by adding a ShouldNotReachHere() in the default clause.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I proposed to use the switch when JDK-8321371 was being reviewed: #16994 (comment)

Frederick (@fbredber) wanted to avoid branches: #16994 (comment)

Copy link
Member Author

@eastig eastig Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch-based version is committed: e984fde

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so this inline assembly was to optimize SpinPause, since it sits in hot loop. Have you looked at disassembly for SpinPause before/after? On my M1, I see this:

% lldb -o "disassemble -n SpinPause" -o quit -- build/macosx-aarch64-server-release/images/jdk/lib/server/libjvm.dylib

# Before
libjvm.dylib`::SpinPause():
libjvm.dylib[0x89f0d4] <+0>:  stp    x29, x30, [sp, #-0x10]!
libjvm.dylib[0x89f0d8] <+4>:  mov    x29, sp
libjvm.dylib[0x89f0dc] <+8>:  adrp   x8, 1409
libjvm.dylib[0x89f0e0] <+12>: add    x8, x8, #0x80 ; VM_Version::_spin_wait
libjvm.dylib[0x89f0e4] <+16>: ldrsw  x8, [x8]
libjvm.dylib[0x89f0e8] <+20>: lsl    x8, x8, #3
libjvm.dylib[0x89f0ec] <+24>: adr    x9, 0x89f100 ; <+44> at os_bsd_aarch64.cpp:545:5
libjvm.dylib[0x89f0f0] <+28>: add    x9, x9, x8
libjvm.dylib[0x89f0f4] <+32>: br     x9
libjvm.dylib[0x89f0f8] <+36>: b      0x89f114       ; <+64> at os_bsd_aarch64.cpp:561:5
libjvm.dylib[0x89f0fc] <+40>: nop    
libjvm.dylib[0x89f100] <+44>: nop    
libjvm.dylib[0x89f104] <+48>: b      0x89f114       ; <+64> at os_bsd_aarch64.cpp:561:5
libjvm.dylib[0x89f108] <+52>: isb    
libjvm.dylib[0x89f10c] <+56>: b      0x89f114       ; <+64> at os_bsd_aarch64.cpp:561:5
libjvm.dylib[0x89f110] <+60>: yield  
libjvm.dylib[0x89f114] <+64>: mov    w0, #0x1 ; =1 
libjvm.dylib[0x89f118] <+68>: ldp    x29, x30, [sp], #0x10
libjvm.dylib[0x89f11c] <+72>: ret  

# After
libjvm.dylib`::SpinPause():
libjvm.dylib[0x89f074] <+0>:   stp    x29, x30, [sp, #-0x10]!
libjvm.dylib[0x89f078] <+4>:   mov    x29, sp
libjvm.dylib[0x89f07c] <+8>:   adrp   x8, 1409
libjvm.dylib[0x89f080] <+12>:  add    x8, x8, #0x80 ; VM_Version::_spin_wait
libjvm.dylib[0x89f084] <+16>:  ldr    w8, [x8]
libjvm.dylib[0x89f088] <+20>:  add    w8, w8, #0x1
libjvm.dylib[0x89f08c] <+24>:  cmp    w8, #0x4
libjvm.dylib[0x89f090] <+28>:  b.hi   0x89f0ec       ; <+120> at os_bsd_aarch64.cpp:551:7
libjvm.dylib[0x89f094] <+32>:  adrp   x9, 0
libjvm.dylib[0x89f098] <+36>:  add    x9, x9, #0xfc ; ___lldb_unnamed_symbol66913
libjvm.dylib[0x89f09c] <+40>:  adr    x10, 0x89f09c ; <+40> at os_bsd_aarch64.cpp
libjvm.dylib[0x89f0a0] <+44>:  ldrsw  x11, [x9, x8, lsl #2]
libjvm.dylib[0x89f0a4] <+48>:  add    x10, x10, x11
libjvm.dylib[0x89f0a8] <+52>:  br     x10
libjvm.dylib[0x89f0ac] <+56>:  nop    
libjvm.dylib[0x89f0b0] <+60>:  mov    w0, #0x1 ; =1 
libjvm.dylib[0x89f0b4] <+64>:  ldp    x29, x30, [sp], #0x10
libjvm.dylib[0x89f0b8] <+68>:  ret    
libjvm.dylib[0x89f0bc] <+72>:  isb    
libjvm.dylib[0x89f0c0] <+76>:  mov    w0, #0x1 ; =1 
libjvm.dylib[0x89f0c4] <+80>:  ldp    x29, x30, [sp], #0x10
libjvm.dylib[0x89f0c8] <+84>:  ret    
libjvm.dylib[0x89f0cc] <+88>:  yield  
libjvm.dylib[0x89f0d0] <+92>:  mov    w0, #0x1 ; =1 
libjvm.dylib[0x89f0d4] <+96>:  ldp    x29, x30, [sp], #0x10
libjvm.dylib[0x89f0d8] <+100>: ret    
libjvm.dylib[0x89f0dc] <+104>: sb     
libjvm.dylib[0x89f0e0] <+108>: mov    w0, #0x1 ; =1 
libjvm.dylib[0x89f0e4] <+112>: ldp    x29, x30, [sp], #0x10
libjvm.dylib[0x89f0e8] <+116>: ret    
libjvm.dylib[0x89f0ec] <+120>: adrp   x0, 1063
libjvm.dylib[0x89f0f0] <+124>: add    x0, x0, #0xe2a ; "src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp"
libjvm.dylib[0x89f0f4] <+128>: mov    w1, #0x227 ; =551 
libjvm.dylib[0x89f0f8] <+132>: bl     0x311f84       ; report_should_not_reach_here at debug.cpp:247

So I think switch is fairly well compiled. On first glance, it generates more code by duplicating the epilog for every case, but I think that is a bit cleaner than trying to do branch-overs. It generates marginally better code if you place case in enum order, and do should_not_reach here branch only for debug builds:

#ifdef ASSERT
    default:
      ShouldNotReachHere();
#endif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. When writing inline asm, it doesn't much help to try to out-guess the compiler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my hand written assembly with the current default YIELD: https://godbolt.org/z/3YjxzW4sW

SpinPause(SpinWait::Inst):
        mov     w8, w0
        tbz     w8, #0, .Ltmp0
        yield
        b       .Ltmp1
.Ltmp0:
.Ltmp1:

        ret
Iterations:        100
Instructions:      500
Total Cycles:      152

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original assembly with the current default YIELD: https://godbolt.org/z/McWGM4f44

SpinPause(SpinWait::Inst):
        lsl     w8, w0, #3
        sxtw    x8, w8
        adr     x9, #20
        add     x9, x9, x8
        br      x9
        b       .Ltmp1
        nop
        nop
        b       .Ltmp1
        nop
        b       .Ltmp1
        yield
.Ltmp1:

        ret
Iterations:        100
Instructions:      1300
Total Cycles:      254

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the default YIELD we have:

  • Original: ~250 clocks
  • Compiler binary search tree: ~200 clocks (-20%)
  • Optimized with TBZ: ~150 clocks (-40%)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the differences might be smaller on the real hardware. Maybe everything will be around 200 clocks.
If we need code easy to maintain then this is switch. If we need performance then this is tbz.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I did quite a lot of performance measurements before I settled on the assembler solution. Have you made any comparison before and after changing from the assembler code to the new c++ code? If so what tests did you run? Since the code is called in tight locking loops, this code really matters.

Given that this routine is a backoff delay, it's not clear that speeding it up helps.

However, if we really wanted to speed this up we'd use an indirect branch to one of four code blocks: isb; ret would be one of them. But I don't think we do want to speed it up.

@theRealAph
Copy link
Contributor

Do you not think we should implement the Arm-recommended solution in https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm?

@eastig
Copy link
Member Author

eastig commented Jul 21, 2025

Do you not think we should implement the Arm-recommended solution in https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm?

When we get armv8.7 hardware, I'd like to try their solution.
ISB-based solution is looked tuned for Graviton 2/3/4. I see it is implemented to support delays requested in ns. Do we request ns delays in Hotspot?

@theRealAph
Copy link
Contributor

theRealAph commented Jul 21, 2025

Do you not think we should implement the Arm-recommended solution in https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm?

When we get armv8.7 hardware, I'd like to try their solution.

OK, that makes sense.

ISB-based solution is looked tuned for Graviton 2/3/4. I see it is implemented to support delays requested in ns. Do we request ns delays in Hotspot?

I don't think that matters. We should experiment when we have hardware to find a sensible default.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right then, looks good to me.

/reviewers 2

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 22, 2025
@openjdk
Copy link

openjdk bot commented Jul 22, 2025

@shipilev
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Jul 22, 2025
theRealAph
theRealAph approved these changes Jul 23, 2025
@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot [email protected] ready Pull request is ready to be integrated rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

4 participants