rqspinlock/freeze #10360

kkdwvd · 2025-11-25T01:56:12Z

No description provided.

Copilot

Pull request overview

This PR modifies the resilient spinlock implementation to reorder critical operations for better handling of NMI reentrancy. The key change is moving grab_held_lock_entry before the lock acquisition attempts to ensure that deadlock detection entries are established before any lock can be acquired, preventing race conditions with reentrant interrupts.

Key Changes:

Reordered grab_held_lock_entry to occur before try_cmpxchg_acquire in res_spin_lock to handle NMI reentrancy correctly
Reordered unlock operations in res_spin_unlock to perform the lock release before clearing the held lock entry, addressing NMI AA heuristic detection
Removed redundant grab_held_lock_entry calls from slow path functions since the entry is now held by the caller

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
include/asm-generic/rqspinlock.h	Reordered lock/unlock operations and updated comments to reflect new ordering for NMI reentrancy handling
kernel/bpf/rqspinlock.c	Removed `grab_held_lock_entry` calls from slow path functions with explanatory comments

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kernel/bpf/rqspinlock.c

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Ritesh reported that timeouts occurred frequently for rqspinlock despite reentrancy on the same lock on the same CPU in [0]. This patch closes one of the races leading to this behavior, and reduces the frequency of timeouts. We currently have a tiny window between the fast-path cmpxchg and the grabbing of the lock entry where an NMI could land, attempt the same lock that was just acquired, and end up timing out. This is not ideal. Instead, move the lock entry acquisition from the fast path to before the cmpxchg, and remove the grabbing of the lock entry in the slow path, assuming it was already taken by the fast path. The TAS fallback is invoked directly without being preceded by the typical fast path, therefore we must continue to grab the deadlock detection entry in that case. Case on lock leading to missed AA: cmpxchg lock A <NMI> ... rqspinlock acquisition of A ... timeout </NMI> grab_held_lock_entry(A) There is a similar case when unlocking the lock. If the NMI lands between the WRITE_ONCE and smp_store_release, it is possible that we end up in a situation where the NMI fails to diagnose the AA condition, leading to a timeout. Case on unlock leading to missed AA: WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL) <NMI> ... rqspinlock acquisition of A ... timeout </NMI> smp_store_release(A->locked, 0) The patch changes the order on unlock to smp_store_release() succeeded by WRITE_ONCE() of NULL. This avoids the missed AA detection described above, but may lead to a false positive if the NMI lands between these two statements, which is acceptable (and preferred over a timeout). The original intention of the reverse order on unlock was to prevent the following possible misdiagnosis of an ABBA scenario: grab entry A lock A grab entry B lock B unlock B smp_store_release(B->locked, 0) grab entry B lock B grab entry A lock A ! <detect ABBA> WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL) If the store release were is after the WRITE_ONCE, the other CPU would not observe B in the table of the CPU unlocking the lock B. However, since the threads are obviously participating in an ABBA deadlock, it is no longer appealing to use the order above since it may lead to a 250 ms timeout due to missed AA detection. [0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com Fixes: 0d80e7f ("rqspinlock: Choose trylock fallback for NMI waiters") Reported-by: Ritesh Oedayrajsingh Varma <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>

Currently, while we enter the check_timeout call immediately due to the way the ts.spin is initialized, we still invoke the AA and ABBA checks in the second invocation, and only initialize the timestamp in the first one. Since each iteration is at least done with a 1ms delay, this can add delays in detection of AA deadlocks, up to a ms. Rework check_timeout() to avoid this. First, call check_deadlock_AA() while initializing the timestamps for the wait period. This also means that we only do it once per waiting period, instead of every invocation. Finally, drop check_deadlock() and call check_deadlock_ABBA() directly. To save on unnecessary ktime_get_mono_fast_ns() in case of AA deadlock, sample the time only if it returns 0. Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>

In addition to deferring to the trylock fallback in NMIs, only do so when an rqspinlock waiter is queued on the current CPU. This is detected by noticing a non-zero node index. This allows NMI waiters to join the waiter queue if it isn't interrupting an existing rqspinlock waiter, and increase the chances of fairly obtaining the lock, performing deadlock detection as the head, and not being starved while attempting the trylock. The trylock path in particular is unlikely to succeed under contention, as it relies on the lock word becoming 0, which indicates no contention. This means that the most likely result for NMIs attempting a trylock is a timeout under contention if they don't hit an AA or ABBA case. The core problem being addressed through the fixed commit was removing the dependency edge between an NMI queue waiter and the queue waiter it is interrupting. Whenever a circular dependency forms, and with no way to break it (as non-head waiters don't poll for deadlocks or timeouts), we would enter into a deadlock. A trylock either breaks such an edge by probing for deadlocks, and finally terminating the waiting loop using a timeout. By excluding queueing on CPUs where the node index is non-zero for NMIs, this sort of dependency is broken. The CPU enters the trylock path for those cases, and falls back to deadlock checks and timeouts. However, in other case where it doesn't interrupt the CPU in the slow path while its queued on the lock, it can join the queue as a normal waiter, and avoid trylock associated starvation and subsequent timeouts. There are a few remaining cases here that matter: the NMI can still preempt the owner in its critical section, and if it queues as a non-head waiter, it can end up impeding the progress of the owner. While this won't deadlock, since the head waiter will eventually signal the NMI waiter to either stop (due to a timeout), it can still lead to long timeouts. These gaps will be addressed in subsequent commits. Note that while the node count detection approach is less conservative than simply deferring NMIs to trylock, it is going to return errors where attempts to lock B in NMI happen while waiters for lock A are in a lower context on the same CPU. However, this only occurs when the lower context is queued in the slow path, and the NMI attempt can proceed without failure in all other cases. To continue to prevent AA deadlocks (or ABBA in a similar NMI interrupting lower context pattern), we'd need a more fleshed out algorithm to unlink NMI waiters after they queue and detect such cases. However, all that complexity isn't appealing yet to reduce the failure rate in the small window inside the slow path. It is important to note that reentrancy in the slow path can also happen through trace_contention_{begin,end}, but in those cases, unlike an NMI, the forward progress of the head waiter (or the predecessor in general) is not being blocked. Fixes: 0d80e7f ("rqspinlock: Choose trylock fallback for NMI waiters") Reported-by: Ritesh Oedayrajsingh Varma <[email protected]> Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>

The original trylock fallback was inherited from qspinlock, and then reused for the reentrant NMIs while the slow path is active. However, under contention, it is very unlikely for the trylock to succeed in taking the lock. In addition, a trylock also has no fairness guarantees, and thus is prone to starvation issues under extreme scenarios. The original qspinlock had no choice in terms of returning an error the caller; if the node count was breached, it had to fall back to trylock to attempt to take the lock. In case of rqspinlock, we do have the option of returning to the user. Thus, simply attempt the trylock once, and instead of spinning, return an error in case the lock cannot be taken. This ends up significantly reducing the time spent in the trylock fallback, since we no longer wait for the timeout duration trying to aimlessly acquire the lock when there's a high-probability that under contention, it won't be available to us anyway. Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-28T23:06:58Z

kernel/bpf/rqspinlock.c

+	if (check_deadlock_AA(lock))
+		return -EDEADLK;


The check_deadlock_AA call at the queue entry point may cause issues when combined with the fast path's grab_held_lock_entry. If the fast path fails and we enter the slowpath, grab_held_lock_entry was already called (line 185 in rqspinlock.h). Then if we reach this queue label and call check_deadlock_AA, it will check the lock against held entries including the entry just created for this lock. This means it may incorrectly detect the current lock acquisition as an AA deadlock (finding the lock in position cnt-1, which was just added by grab_held_lock_entry). The check should either skip the topmost entry or be conditioned on whether we're coming from the fast path.

That's how check_deadlock_AA already works.

Copilot · 2025-11-28T23:06:58Z

kernel/bpf/rqspinlock.c

+	 * queueing, hence do not spin.
 	 */
-	if (unlikely(idx >= _Q_MAX_NODES || in_nmi())) {
+	if (unlikely(idx >= _Q_MAX_NODES || (in_nmi() && idx > 0))) {


The condition (in_nmi() && idx > 0) now fails the lock acquisition in NMI context when there's already one queued node (idx > 0). However, the comment above (lines 440-447) mentions preventing 'long timeouts where we interrupt the pending waiter', but this check happens after we've already incremented node->count at line 454. This means the count is incremented but then we fail and release it at err_release_node. Consider moving this check before incrementing node->count to avoid unnecessary counter manipulation.

Copilot · 2025-11-28T23:06:58Z

include/asm-generic/rqspinlock.h

+	grab_held_lock_entry(lock);
+
+	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
 		return 0;
-	}
 	return resilient_queued_spin_lock_slowpath(lock, val);


When the fast path fails (line 187) and calls resilient_queued_spin_lock_slowpath, the held lock entry has already been created via grab_held_lock_entry at line 185. However, if the slowpath fails and returns an error without acquiring the lock, the caller never gets a chance to call release_held_lock_entry. This will leak the held lock entry, causing the cnt to be permanently incremented and potentially corrupting the held locks tracking. The slowpath should call release_held_lock_entry before returning an error, or the fast path should only call grab_held_lock_entry after successful acquisition.

It already does the release.

While previous commits sufficiently address the deadlocks, there are still scenarios where queueing of waiters in NMIs can exacerbate the possibility of timeouts. Consider the case below: CPU 0 <NMI> res_spin_lock(A) -> becomes non-head waiter </NMI> lock owner in CS or pending waiter spinning CPU 1 res_spin_lock(A) -> head waiter spinning on owner/pending bits In such a scenario, the non-head waiter in NMI on CPU 0 will not poll for deadlocks or timeout since it will simply queue behind previous waiter (head on CPU 1), and also not enter the trylock fallback since no rqspinlock queue waiter is active on CPU 0. In such a scenario, the transaction initiated by the head waiter on CPU 1 will timeout, signalling the NMI and ending the cyclic dependency, but it will cost 250 ms of time. Instead, the NMI on CPU 0 could simply check for the presence of an AA deadlock and only proceed with queueing on success. Add such a check right before any form of queueing is initiated. The reason the AA deadlock check is not used in conjunction with in_nmi() is that a similar case could occur due to a reentrant path in the owner's critical section, and unconditionally checking for AA before entering the queueing path avoids expensive timeouts. Non-NMI reentrancy only happens at controlled points in the slow path (with specific tracepoints which do not impede the forward progress of a waiter loop), or in the owner CS, while NMIs can land anywhere. While this check is only needed for non-head waiter queueing, checking whether we are head or not is racy without xchg_tail, and after that point, we are already queued, hence for simplicity we must invoke the check unconditionally. Note that a more contrived case could still be constructed by using two locks, and interrupting the progress of the respective owners by non-head waiters of the other lock, in an ABBA fashion, which would still not be covered by the current set of checks and conditions. It would still lead to a timeout though, and not a deadlock. An ABBA check cannot happen optimistically before the queueing, since it can be racy, and needs to be happen continuously during the waiting period, which would then require an unlinking step for queued NMI/reentrant waiters. This is beyond the scope of this patch. Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>

Add stats to observe the success and failure rate of lock acquisition attempts in various contexts. Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>

kkdwvd force-pushed the rqspinlock/freeze branch from 36b48be to ead4767 Compare November 25, 2025 03:02

kkdwvd requested a review from Copilot November 25, 2025 03:08

Copilot started reviewing on behalf of kkdwvd November 25, 2025 03:13 View session

Copilot finished reviewing on behalf of kkdwvd November 25, 2025 03:17

Copilot AI reviewed Nov 25, 2025

View reviewed changes

kernel/bpf/rqspinlock.c Outdated Show resolved Hide resolved

kernel/bpf/rqspinlock.c Outdated Show resolved Hide resolved

kernel/bpf/rqspinlock.c Outdated Show resolved Hide resolved

kkdwvd force-pushed the rqspinlock/freeze branch from ead4767 to 98e2d34 Compare November 25, 2025 19:57

kkdwvd requested a review from Copilot November 25, 2025 19:58

Copilot started reviewing on behalf of kkdwvd November 25, 2025 19:59 View session

Copilot finished reviewing on behalf of kkdwvd November 25, 2025 20:02

Copilot AI reviewed Nov 25, 2025

View reviewed changes

kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 4 times, most recently from e4c5f93 to 233a075 Compare November 26, 2025 19:30

kkdwvd added 4 commits November 28, 2025 21:35

kkdwvd force-pushed the rqspinlock/freeze branch from 98e2d34 to 817e36a Compare November 28, 2025 22:41

kkdwvd requested a review from Copilot November 28, 2025 23:03

Copilot started reviewing on behalf of kkdwvd November 28, 2025 23:04 View session

Copilot finished reviewing on behalf of kkdwvd November 28, 2025 23:06

Copilot AI reviewed Nov 28, 2025

View reviewed changes

kkdwvd added 2 commits November 28, 2025 23:23

selftests/bpf: Add success stats to rqspinlock stress test

91c750a

Add stats to observe the success and failure rate of lock acquisition attempts in various contexts. Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>

kkdwvd force-pushed the rqspinlock/freeze branch from 817e36a to 91c750a Compare November 28, 2025 23:24

kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 4 times, most recently from 2751ec7 to 886a6a6 Compare November 29, 2025 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rqspinlock/freeze #10360

rqspinlock/freeze #10360

Uh oh!

kkdwvd commented Nov 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 28, 2025

Uh oh!

kkdwvd Nov 28, 2025

Uh oh!

Copilot AI Nov 28, 2025

Uh oh!

kkdwvd Nov 28, 2025

Uh oh!

Copilot AI Nov 28, 2025

Uh oh!

kkdwvd Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rqspinlock/freeze #10360

Are you sure you want to change the base?

rqspinlock/freeze #10360

Uh oh!

Conversation

kkdwvd commented Nov 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

kkdwvd Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

kkdwvd Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

kkdwvd Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants