spin lock #605

thompsonbry · 2024-11-22T17:07:01Z

You mentioned an issue with the use of _mm_pause in the spin lock implementation here. We've been using a variant of this spin lock

// spin_lock code is taken from: https://rigtorp.se/spinlock/

| | // modified to handle ARM
| | struct spin_lock_t {
| | std::atomic lock_ = {0};
| |
| | void lock() noexcept {
| | for (;;) {
| | // Optimistically assume the lock is free on the first try
| | if (!lock_.exchange(true, std::memory_order_acquire)) {
| | return;
| | }
| | // Wait for lock to be released without generating cache misses
| | while (lock_.load(std::memory_order_relaxed)) {
| | cpu_acquiesce();
| | }
| | }
| | }
| |
| | bool try_lock() noexcept {
| | // First do a relaxed load to check if lock is free in order to prevent
| | // unnecessary cache misses if someone does while(!try_lock())
| | return !lock_.load(std::memory_order_relaxed) &&
| | !lock_.exchange(true, std::memory_order_acquire);
| | }
| |
| | void unlock() noexcept {
| | lock_.store(false, std::memory_order_release);
| | }
| | }; // spin_lock_t

laurynas-biveinis · 2024-11-25T15:47:12Z

Will you be able to get by with a pure spinlock for locking? Without putting the thread to sleep after some spinning?

thompsonbry · 2024-11-25T15:55:54Z

Likely. All structural mutation operations should be very fast. And we can allocate before we lock. Allocation is much of the time unless we have a lightweight pool. We generally do something more like ROWEX and I can help work through what that might look like as we get further along, though it offered no benefit in their testing. Templating the lock strategy and the key structure would both seem to increase flexibility. Bryan

…

On Mon, Nov 25, 2024 at 10:47 Laurynas Biveinis ***@***.***> wrote: Will you be able to get by with a pure spinlock for locking? Without putting the thread to sleep after some spinning? — Reply to this email directly, view it on GitHub <#605 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATW7YHERK2EORELU6NIMQD2CNBBPAVCNFSM6AAAAABSJY567GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJYGM4DCMJYGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

laurynas-biveinis · 2024-12-07T14:01:15Z

The referenced spinlock implementation does not carry over directly to this setting - and the current implementation already addresses the points there. The read lock operation is a pure read. The write lock is a single CAS attempt after a read, which, if fails, does not retry. So in a way this spinlock is already TTAS, although the correspondence is not 1:1. The memory barriers are as weak as possible too.

What needs improving however is the spin loop _mm_pause on very iteration

thompsonbry · 2024-12-07T14:37:34Z

Copying umit. What is the motivation behind _mm_pause()? If I recall what you had said before, it puts the thread to sleep to allow context switching. But why would we be spinning long enough for that to make sense? I have not looked at the details of the write lock path. Is the allocation happening before or after the lock is taken? The bw-tree originally (I think it was the first paper on this, but maybe there was other work) let the different threads simply race until the CAS to install the new node version. So both threads would allocate and both would gather up the data and produce a consolidated new node. Without question the looser did work that had to be thrown away and the code reflected this with status returns which included codes that operations should be retried and loops to race to outcomes. Taking a lock ensures that only one thread does the work. So it should be more efficient in cases where it is likely for a race to occur. But why park a thread vs reduce the time with the lock held (allocate before lock) and let the thread spin? The ART nodes are all very small and consolidation time while holding the lock should be small as well. Just trying to figure out the rationale here. Thanks, Bryan

…

On Sat, Dec 7, 2024 at 06:01 Laurynas Biveinis ***@***.***> wrote: The referenced spinlock implementation does not carry over directly to this setting - and the current implementation already addresses the points there. The read lock operation is a pure read. The write lock is a single CAS attempt after a read, which, if fails, does not retry. So in a way this spinlock is already TTAS, although the correspondence is not 1:1. The memory barriers are as weak as possible too. What needs improving however is the spin loop _mm_pause on very iteration — Reply to this email directly, view it on GitHub <#605 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATW7YA32WJLMG4IKFOZGYT2EL5UBAVCNFSM6AAAAABSJY567GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRVGE3TENBXGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

thompsonbry · 2024-12-07T14:49:13Z

mm_pause won’t yield the thread! It keeps user thread but just pauses it ~30 clocks (if memory serves). It doesn’t context switch. Without it, there could be excessive cache line contention and hence cause unnecessary memory transfer. It is really spin locks. If we allow context switch immediately it would be much more expensive. Umit

…

Sent from my iPhone On Dec 7, 2024, at 5:38 PM, Bryan B. Thompson ***@***.***> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Copying umit. What is the motivation behind _mm_pause()? If I recall what you had said before, it puts the thread to sleep to allow context switching. But why would we be spinning long enough for that to make sense? I have not looked at the details of the write lock path. Is the allocation happening before or after the lock is taken? The bw-tree originally (I think it was the first paper on this, but maybe there was other work) let the different threads simply race until the CAS to install the new node version. So both threads would allocate and both would gather up the data and produce a consolidated new node. Without question the looser did work that had to be thrown away and the code reflected this with status returns which included codes that operations should be retried and loops to race to outcomes. Taking a lock ensures that only one thread does the work. So it should be more efficient in cases where it is likely for a race to occur. But why park a thread vs reduce the time with the lock held (allocate before lock) and let the thread spin? The ART nodes are all very small and consolidation time while holding the lock should be small as well. Just trying to figure out the rationale here. Thanks, Bryan On Sat, Dec 7, 2024 at 06:01 Laurynas Biveinis ***@***.******@***.***>> wrote: The referenced spinlock implementation does not carry over directly to this setting - and the current implementation already addresses the points there. The read lock operation is a pure read. The write lock is a single CAS attempt after a read, which, if fails, does not retry. So in a way this spinlock is already TTAS, although the correspondence is not 1:1. The memory barriers are as weak as possible too. What needs improving however is the spin loop _mm_pause on very iteration — Reply to this email directly, view it on GitHub<#605 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AATW7YA32WJLMG4IKFOZGYT2EL5UBAVCNFSM6AAAAABSJY567GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRVGE3TENBXGQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

thompsonbry · 2024-12-07T14:52:59Z

Ok. What is the problem that we are trying to solve then?

thompsonbry · 2024-12-07T14:53:08Z

See https://rigtorp.se/spinlock/ Umit

…

Sent from my iPhone On Dec 7, 2024, at 5:49 PM, Catalyurek, Umit ***@***.***> wrote: mm_pause won’t yield the thread! It keeps user thread but just pauses it ~30 clocks (if memory serves). It doesn’t context switch. Without it, there could be excessive cache line contention and hence cause unnecessary memory transfer. It is really spin locks. If we allow context switch immediately it would be much more expensive. Umit

Sent from my iPhone On Dec 7, 2024, at 5:38 PM, Bryan B. Thompson ***@***.***> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Copying umit. What is the motivation behind _mm_pause()? If I recall what you had said before, it puts the thread to sleep to allow context switching. But why would we be spinning long enough for that to make sense? I have not looked at the details of the write lock path. Is the allocation happening before or after the lock is taken? The bw-tree originally (I think it was the first paper on this, but maybe there was other work) let the different threads simply race until the CAS to install the new node version. So both threads would allocate and both would gather up the data and produce a consolidated new node. Without question the looser did work that had to be thrown away and the code reflected this with status returns which included codes that operations should be retried and loops to race to outcomes. Taking a lock ensures that only one thread does the work. So it should be more efficient in cases where it is likely for a race to occur. But why park a thread vs reduce the time with the lock held (allocate before lock) and let the thread spin? The ART nodes are all very small and consolidation time while holding the lock should be small as well. Just trying to figure out the rationale here. Thanks, Bryan On Sat, Dec 7, 2024 at 06:01 Laurynas Biveinis ***@***.******@***.***>> wrote: The referenced spinlock implementation does not carry over directly to this setting - and the current implementation already addresses the points there. The read lock operation is a pure read. The write lock is a single CAS attempt after a read, which, if fails, does not retry. So in a way this spinlock is already TTAS, although the correspondence is not 1:1. The memory barriers are as weak as possible too. What needs improving however is the spin loop _mm_pause on very iteration — Reply to this email directly, view it on GitHub<#605 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AATW7YA32WJLMG4IKFOZGYT2EL5UBAVCNFSM6AAAAABSJY567GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRVGE3TENBXGQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

laurynas-biveinis · 2024-12-07T15:01:18Z

_mm_pause does not put the thread to sleep for the OS scheduler, it compiles to a CPU instruction PAUSE, which apparently enables CPU to save memory traffic and power.

Re. write lock path, what allocation you are referring to? New tree node allocation? If so, that happens before the lock. The allocated nodes are cached thread-locally in case of restarts to avoid repeated allocs-deallocs.

Re. parking threads, with OLC ART it only happens in the read lock path.

I'd imagine the design as _mm_pause for ~5 iterations (maybe even 1-2 busy wait iterations first), then ~10 iterations with random short sleep, repeat

thompsonbry · 2024-12-07T15:23:41Z

Re. parking threads, with OLC ART it only happens in the read lock path. ^ Do you mean "write lock path"?

…

On Sat, Dec 7, 2024 at 7:01 AM Laurynas Biveinis ***@***.***> wrote: _mm_pause does not put the thread to sleep for the OS scheduler, it compiles to a CPU instruction PAUSE, which apparently enables CPU to save memory traffic and power. Re. write lock path, what allocation you are referring to? New tree node allocation? If so, that happens before the lock. The allocated nodes are cached thread-locally in case of restarts to avoid repeated allocs-deallocs. Re. parking threads, with OLC ART it only happens in the read lock path. I'd imagine the design as _mm_pause for ~5 iterations (maybe even 1-2 busy wait iterations first), then ~10 iterations with random short sleep, repeat — Reply to this email directly, view it on GitHub <#605 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATW7YAJ4LREVAMJB3T3UTT2EMEVJAVCNFSM6AAAAABSJY567GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRVGE4TONZRGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

thompsonbry · 2024-12-07T15:24:12Z

Re. write lock path, what allocation you are referring to? New tree node allocation? If so, that happens before the lock. The allocated nodes are cached thread-locally in case of restarts to avoid repeated allocs-deallocs. ^ Yes, that is what I meant. Allocation before you take the write lock. On Sat, Dec 7, 2024 at 7:23 AM Bryan B. Thompson ***@***.***> wrote:

…

Re. parking threads, with OLC ART it only happens in the read lock path. ^ Do you mean "write lock path"? On Sat, Dec 7, 2024 at 7:01 AM Laurynas Biveinis ***@***.***> wrote: > _mm_pause does not put the thread to sleep for the OS scheduler, it > compiles to a CPU instruction PAUSE, which apparently enables CPU to save > memory traffic and power. > > Re. write lock path, what allocation you are referring to? New tree node > allocation? If so, that happens before the lock. The allocated nodes are > cached thread-locally in case of restarts to avoid repeated allocs-deallocs. > > Re. parking threads, with OLC ART it only happens in the read lock path. > > I'd imagine the design as _mm_pause for ~5 iterations (maybe even 1-2 > busy wait iterations first), then ~10 iterations with random short sleep, > repeat > > — > Reply to this email directly, view it on GitHub > <#605 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AATW7YAJ4LREVAMJB3T3UTT2EMEVJAVCNFSM6AAAAABSJY567GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRVGE4TONZRGI> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

laurynas-biveinis · 2024-12-07T15:32:57Z

All algorithms looks like this: reads: read lock (spinning if needed), do things, see if read lock is still valid, if not restart

writes: read lock (spinning if needed), do things, try to upgrade read lock to write lock with a single CAS, if failed, restart.

So both readers and writers could park threads if spinlock is replaced with a lock

thompsonbry · 2024-12-07T16:09:06Z

I keep forgetting that this is not what they call ROWEX. Ok. So the reader needs to read the version tag when the write lock is not held. It can spin if the write lock is held. Otherwise it gets the version tag. This is done in a load acquire I assume against the header of the ART node? The write lock sets a bit in that 64-bit word in the ART node header which is the spin lock. So it has exclusive access. Readers and would-be writers now spin. The reader checks the post condition to make sure the version tag has not been modified asynchronously and restarts if it has been modified. A write/write conflict is mediated by the exclusive lock. Is that correct?

…

On Sat, Dec 7, 2024 at 7:33 AM Laurynas Biveinis ***@***.***> wrote: All algorithms looks like this: reads: read lock (spinning if needed), do things, see if read lock is still valid, if not restart writes: read lock (spinning if needed), do things, try to upgrade read lock to write lock with a single CAS, if failed, restart. So both readers and writers could park threads if spinlock is replaced with a lock — Reply to this email directly, view it on GitHub <#605 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATW7YHC7EJE2IMOJNS5FLD2EMIMBAVCNFSM6AAAAABSJY567GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRVGIYTAMRWGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

laurynas-biveinis · 2024-12-07T16:27:57Z

All correct

thompsonbry · 2024-12-13T15:38:14Z

I am going to close this one for now. There is now a build time option to suppress the use of _mm_pause(). There might be something else to be done here, but it probably warrants its own issue.

thompsonbry closed this as completed Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spin lock #605

spin lock #605

thompsonbry commented Nov 22, 2024

laurynas-biveinis commented Nov 25, 2024

thompsonbry commented Nov 25, 2024 via email

laurynas-biveinis commented Dec 7, 2024

thompsonbry commented Dec 7, 2024 via email

thompsonbry commented Dec 7, 2024 via email

thompsonbry commented Dec 7, 2024

thompsonbry commented Dec 7, 2024 via email

laurynas-biveinis commented Dec 7, 2024

thompsonbry commented Dec 7, 2024 via email

thompsonbry commented Dec 7, 2024 via email

laurynas-biveinis commented Dec 7, 2024

thompsonbry commented Dec 7, 2024 via email

laurynas-biveinis commented Dec 7, 2024

thompsonbry commented Dec 13, 2024

spin lock #605

spin lock #605

Comments

thompsonbry commented Nov 22, 2024

// spin_lock code is taken from: https://rigtorp.se/spinlock/

laurynas-biveinis commented Nov 25, 2024

thompsonbry commented Nov 25, 2024 via email

laurynas-biveinis commented Dec 7, 2024

thompsonbry commented Dec 7, 2024 via email

thompsonbry commented Dec 7, 2024 via email

thompsonbry commented Dec 7, 2024

thompsonbry commented Dec 7, 2024 via email

laurynas-biveinis commented Dec 7, 2024

thompsonbry commented Dec 7, 2024 via email

thompsonbry commented Dec 7, 2024 via email

laurynas-biveinis commented Dec 7, 2024

thompsonbry commented Dec 7, 2024 via email

laurynas-biveinis commented Dec 7, 2024

thompsonbry commented Dec 13, 2024