linux

mirror of synced 2025-03-06 20:59:54 +01:00

Author	SHA1	Message	Date
Nysal Jan K.A.	734ad0af36	powerpc/qspinlock: Fix deadlock in MCS queue If an interrupt occurs in queued_spin_lock_slowpath() after we increment qnodesp->count and before node->lock is initialized, another CPU might see stale lock values in get_tail_qnode(). If the stale lock value happens to match the lock on that CPU, then we write to the "next" pointer of the wrong qnode. This causes a deadlock as the former CPU, once it becomes the head of the MCS queue, will spin indefinitely until it's "next" pointer is set by its successor in the queue. Running stress-ng on a 16 core (16EC/16VP) shared LPAR, results in occasional lockups similar to the following: $ stress-ng --all 128 --vm-bytes 80% --aggressive \ --maximize --oomable --verify --syslog \ --metrics --times --timeout 5m watchdog: CPU 15 Hard LOCKUP ...... NIP [c0000000000b78f4] queued_spin_lock_slowpath+0x1184/0x1490 LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90 Call Trace: 0xc000002cfffa3bf0 (unreliable) _raw_spin_lock+0x6c/0x90 raw_spin_rq_lock_nested.part.135+0x4c/0xd0 sched_ttwu_pending+0x60/0x1f0 __flush_smp_call_function_queue+0x1dc/0x670 smp_ipi_demux_relaxed+0xa4/0x100 xive_muxed_ipi_action+0x20/0x40 __handle_irq_event_percpu+0x80/0x240 handle_irq_event_percpu+0x2c/0x80 handle_percpu_irq+0x84/0xd0 generic_handle_irq+0x54/0x80 __do_irq+0xac/0x210 __do_IRQ+0x74/0xd0 0x0 do_IRQ+0x8c/0x170 hardware_interrupt_common_virt+0x29c/0x2a0 --- interrupt: 500 at queued_spin_lock_slowpath+0x4b8/0x1490 ...... NIP [c0000000000b6c28] queued_spin_lock_slowpath+0x4b8/0x1490 LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90 --- interrupt: 500 0xc0000029c1a41d00 (unreliable) _raw_spin_lock+0x6c/0x90 futex_wake+0x100/0x260 do_futex+0x21c/0x2a0 sys_futex+0x98/0x270 system_call_exception+0x14c/0x2f0 system_call_vectored_common+0x15c/0x2ec The following code flow illustrates how the deadlock occurs. For the sake of brevity, assume that both locks (A and B) are contended and we call the queued_spin_lock_slowpath() function. CPU0 CPU1 ---- ---- spin_lock_irqsave(A) \| spin_unlock_irqrestore(A) \| spin_lock(B) \| \| \| ▼ \| id = qnodesp->count++; \| (Note that nodes[0].lock == A) \| \| \| ▼ \| Interrupt \| (happens before "nodes[0].lock = B") \| \| \| ▼ \| spin_lock_irqsave(A) \| \| \| ▼ \| id = qnodesp->count++ \| nodes[1].lock = A \| \| \| ▼ \| Tail of MCS queue \| \| spin_lock_irqsave(A) ▼ \| Head of MCS queue ▼ \| CPU0 is previous tail ▼ \| Spin indefinitely ▼ (until "nodes[1].next != NULL") prev = get_tail_qnode(A, CPU0) \| ▼ prev == &qnodes[CPU0].nodes[0] (as qnodes[CPU0].nodes[0].lock == A) \| ▼ WRITE_ONCE(prev->next, node) \| ▼ Spin indefinitely (until nodes[0].locked == 1) Thanks to Saket Kumar Bhaskar for help with recreating the issue Fixes: `84990b1695` ("powerpc/qspinlock: add mcs queueing for contended waiters") Cc: stable@vger.kernel.org # v6.2+ Reported-by: Geetika Moolchandani <geetika@linux.ibm.com> Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com> Reported-by: Jijo Varghese <vargjijo@in.ibm.com> Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240829022830.1164355-1-nysal@linux.ibm.com	2024-08-29 15:12:51 +10:00
Nicholas Piggin	b629b54170	powerpc/qspinlock: Rename yield_propagate_owner tunable Rename yield_propagate_owner to yield_sleepy_owner, which better describes what it does (what, not how). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Reviewed-by: "Nysal Jan K.A" <nysal@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231016124305.139923-7-npiggin@gmail.com	2023-10-20 22:43:34 +11:00
Nicholas Piggin	1e6d5f7257	powerpc/qspinlock: Propagate sleepy if previous waiter is preempted The sleepy (aka lock-owner-is-preempted) condition is propagated down the queue by each waiter. If a waiter becomes preempted, it can no longer propagate sleepy. To allow subsequent waiters to yield to the lock owner, also check the lock owner in this case. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Reviewed-by: "Nysal Jan K.A" <nysal@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231016124305.139923-6-npiggin@gmail.com	2023-10-20 22:43:34 +11:00
Nicholas Piggin	fcf77d4427	powerpc/qspinlock: don't propagate the not-sleepy state To simplify things, don't propagate the not-sleepy condition back down the queue. Instead, have the waiters clear their own node->sleepy when finding the lock owner is not preempted. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Reviewed-by: "Nysal Jan K.A" <nysal@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231016124305.139923-5-npiggin@gmail.com	2023-10-20 22:43:34 +11:00
Nicholas Piggin	fd8fae50c9	powerpc/qspinlock: propagate owner preemptedness rather than CPU number Rather than propagating the CPU number of the preempted lock owner, just propagate whether the owner was preempted. Waiters must read the lock value when yielding to it to prevent races anyway, so might as well always load the owner CPU from the lock. To further simplify the code, also don't propagate the -1 (or sleepy=false in the new scheme) down the queue. Instead, have the waiters clear it themselves when finding the lock owner is not preempted. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Reviewed-by: "Nysal Jan K.A" <nysal@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231016124305.139923-4-npiggin@gmail.com	2023-10-20 22:43:34 +11:00
Nicholas Piggin	f656864738	powerpc/qspinlock: stop queued waiters trying to set lock sleepy If a queued waiter notices the lock owner or the previous waiter has been preempted, it attempts to mark the lock sleepy, but it does this as a try-set operation using the original lock value it got when queueing, which will become stale as the queue progresses, and the try-set will fail. Drop this and just set the sleepy seen clock. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Reviewed-by: "Nysal Jan K.A" <nysal@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231016124305.139923-3-npiggin@gmail.com	2023-10-20 22:43:34 +11:00
Nicholas Piggin	f9bc9bbe8a	powerpc/qspinlock: Fix stale propagated yield_cpu yield_cpu is a sample of a preempted lock holder that gets propagated back through the queue. Queued waiters use this to yield to the preempted lock holder without continually sampling the lock word (which would defeat the purpose of MCS queueing by bouncing the cache line). The problem is that yield_cpu can become stale. It can take some time to be passed down the chain, and if any queued waiter gets preempted then it will cease to propagate the yield_cpu to later waiters. This can result in yielding to a CPU that no longer holds the lock, which is bad, but particularly if it is currently in H_CEDE (idle), then it appears to be preempted and some hypervisors (PowerVM) can cause very long H_CONFER latencies waiting for H_CEDE wakeup. This results in latency spikes and hard lockups on oversubscribed partitions with lock contention. This is a minimal fix. Before yielding to yield_cpu, sample the lock word to confirm yield_cpu is still the owner, and bail out of it is not. Thanks to a bunch of people who reported this and tracked down the exact problem using tracepoints and dispatch trace logs. Fixes: `28db61e207` ("powerpc/qspinlock: allow propagation of yield CPU down the queue") Cc: stable@vger.kernel.org # v6.2+ Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reported-by: Laurent Dufour <ldufour@linux.ibm.com> Reported-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Debugged-by: "Nysal Jan K.A" <nysal@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231016124305.139923-2-npiggin@gmail.com	2023-10-18 21:07:21 +11:00
Rohan McLure	6f3136326e	powerpc: qspinlock: Enforce qnode writes prior to publishing to queue Annotate the release barrier and memory clobber (in effect, producing a compiler barrier) in the publish_tail_cpu call. These barriers have the effect of ensuring that qnode attributes are all written to prior to publish the node to the waitqueue. Even while the initial write to the 'locked' attribute is guaranteed to terminate prior to the node being visible, KCSAN still complains that the write is reorderable by the compiler. Issue a kcsan_release() to inform KCSAN of the release barrier contained in publish_tail_cpu(). Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230510033117.1395895-3-rmclure@linux.ibm.com	2023-06-21 15:13:57 +10:00
Rohan McLure	03d44ee80e	powerpc: qspinlock: Mark accesses to qnode lock checks The powerpc implementation of qspinlocks will both poll and spin on the bitlock guarding a qnode. Mark these accesses with READ_ONCE to convey to KCSAN that polling is intentional here. Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230510033117.1395895-2-rmclure@linux.ibm.com	2023-06-21 15:13:57 +10:00
Nicholas Piggin	13959373e9	powerpc/qspinlock: Fix 32-bit build Some 32-bit configurations don't pull in the spin_begin/end/relax definitions. Fix is to restore a lost include. Reported-by: kernel test robot <lkp@intel.com> Fixes: `84990b1695` ("powerpc/qspinlock: add mcs queueing for contended waiters") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/oe-kbuild-all/202212050224.i7uh9fOh-lkp@intel.com Link: https://lore.kernel.org/r/20221208123225.1566113-1-npiggin@gmail.com	2022-12-12 12:34:52 +11:00
Nicholas Piggin	0b2199841a	powerpc/qspinlock: add compile-time tuning adjustments This adds compile-time options that allow the EH lock hint bit to be enabled or disabled, and adds some new options that may or may not help matters. To help with experimentation and tuning. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-18-npiggin@gmail.com	2022-12-02 17:48:50 +11:00
Nicholas Piggin	12b459a5eb	powerpc/qspinlock: provide accounting and options for sleepy locks Finding the owner or a queued waiter on a lock with a preempted vcpu is indicative of an oversubscribed guest causing the lock to get into trouble. Provide some options to detect this situation and have new CPUs avoid queueing for a longer time (more steal iterations) to minimise the problems caused by vcpu preemption on the queue. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-17-npiggin@gmail.com	2022-12-02 17:48:50 +11:00
Nicholas Piggin	39dfc73596	powerpc/qspinlock: allow indefinite spinning on a preempted owner Provide an option that holds off queueing indefinitely while the lock owner is preempted. This could reduce queueing latencies for very overcommitted vcpu situations. This is disabled by default. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-16-npiggin@gmail.com	2022-12-02 17:48:50 +11:00
Nicholas Piggin	cc79701114	powerpc/qspinlock: reduce remote node steal spins Allow for a reduction in the number of times a CPU from a different node than the owner can attempt to steal the lock before queueing. This could bias the transfer behaviour of the lock across the machine and reduce NUMA crossings. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-15-npiggin@gmail.com	2022-12-02 17:48:50 +11:00
Nicholas Piggin	71c235027c	powerpc/qspinlock: use spin_begin/end API Use the spin_begin/spin_cpu_relax/spin_end APIs in qspinlock, which helps to prevent threads issuing a lot of expensive priority nops which may not have much effect due to immediately executing low then medium priority. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-14-npiggin@gmail.com	2022-12-02 17:48:50 +11:00
Nicholas Piggin	f61ab43cc1	powerpc/qspinlock: allow lock stealing in trylock and lock fastpath This change allows trylock to steal the lock. It also allows the initial lock attempt to steal the lock rather than bailing out and going to the slow path. This gives trylock more strength: without this a continually-contended lock will never permit a trylock to succeed. With this change, the trylock has a small but non-zero chance. It also gives the lock fastpath most of the benefit of passing the reservation back through to the steal loop in the slow path without the complexity. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-13-npiggin@gmail.com	2022-12-02 17:48:50 +11:00
Nicholas Piggin	be742c573f	powerpc/qspinlock: add ability to prod new queue head CPU After the head of the queue acquires the lock, it releases the next waiter in the queue to become the new head. Add an option to prod the new head if its vCPU was preempted. This may only have an effect if queue waiters are yielding. Disable this option by default for now, i.e., no logical change. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-12-npiggin@gmail.com	2022-12-02 17:48:50 +11:00
Nicholas Piggin	28db61e207	powerpc/qspinlock: allow propagation of yield CPU down the queue Having all CPUs poll the lock word for the owner CPU that should be yielded to defeats most of the purpose of using MCS queueing for scalability. Yet it may be desirable for queued waiters to yield to a preempted owner. With this change, queue waiters never sample the owner CPU directly from the lock word. The queue head (which is spinning on the lock) propagates the owner CPU back to the next waiter if it finds the owner has been preempted. That waiter then propagates the owner CPU back to the next waiter, and so on. s390 addresses this problem differenty, by having queued waiters sample the lock word to find the owner at a low frequency. That has the advantage of being simpler, the advantage of propagation is that the lock word never has to be accesed by queued waiters, and the transfer of cache lines to transmit the owner data is only required when lock holder vCPU preemption occurs. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-11-npiggin@gmail.com	2022-12-02 17:48:50 +11:00
Nicholas Piggin	b4c3cdc1a6	powerpc/qspinlock: allow stealing when head of queue yields If the head of queue is preventing stealing but it finds the owner vCPU is preempted, it will yield its cycles to the owner which could cause it to become preempted. Add an option to re-allow stealers before yielding, and disallow them again after returning from the yield. Disable this option by default for now, i.e., no logical change. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-10-npiggin@gmail.com	2022-12-02 17:48:50 +11:00
Nicholas Piggin	bd48287b2c	powerpc/qspinlock: implement option to yield to previous node Queued waiters which are not at the head of the queue don't spin on the lock word but their qnode lock word, waiting for the previous queued CPU to release them. Add an option which allows these waiters to yield to the previous CPU if its vCPU is preempted. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-9-npiggin@gmail.com	2022-12-02 17:48:49 +11:00
Nicholas Piggin	085f03311b	powerpc/qspinlock: paravirt yield to lock owner Waiters spinning on the lock word should yield to the lock owner if the vCPU is preempted. This improves performance when the hypervisor has oversubscribed physical CPUs. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-8-npiggin@gmail.com	2022-12-02 17:48:49 +11:00
Nicholas Piggin	e1a31e7fd7	powerpc/qspinlock: store owner CPU in lock word Store the owner CPU number in the lock word so it may be yielded to, as powerpc's paravirtualised simple spinlocks do. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-7-npiggin@gmail.com	2022-12-02 17:48:49 +11:00
Nicholas Piggin	0944534ef4	powerpc/qspinlock: theft prevention to control latency Give the queue head the ability to stop stealers. After a number of spins without successfully acquiring the lock, the queue head sets this, which halts stealing and will assure it is the next owner. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-6-npiggin@gmail.com	2022-12-02 17:48:49 +11:00
Nicholas Piggin	6aa42f883c	powerpc/qspinlock: allow new waiters to steal the lock before queueing Allow new waiters to "steal" the lock before queueing. That is, to acquire it while other CPUs have queued. This particularly helps paravirt performance when physical CPUs are oversubscribed, by keeping the lock from becoming a strict FIFO and vCPU preemption causing queue train wrecks. The new __queued_spin_trylock_steal() function is put in qspinlock.h to save having to move it, because it will be used there by a later change. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-5-npiggin@gmail.com	2022-12-02 17:48:49 +11:00
Nicholas Piggin	b3a73b7db2	powerpc/qspinlock: convert atomic operations to assembly This uses more optimal ll/sc style access patterns (rather than cmpxchg), and also sets the EH=1 lock hint on those operations which acquire ownership of the lock. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-4-npiggin@gmail.com	2022-12-02 17:48:49 +11:00
Nicholas Piggin	84990b1695	powerpc/qspinlock: add mcs queueing for contended waiters This forms the basis of the qspinlock slow path. Like generic qspinlocks and unlike the vanilla MCS algorithm, the lock owner does not participate in the queue, only waiters. The first waiter spins on the lock word, then when the lock is released it takes ownership and unqueues the next waiter. This is how qspinlocks can be implemented with the spinlock API -- lock owners don't need a node, only waiters do. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221126095932.1234527-2-npiggin@gmail.com	2022-12-02 17:48:49 +11:00
Nicholas Piggin	9f61521c7a	powerpc/qspinlock: powerpc qspinlock implementation Add a powerpc specific implementation of queued spinlocks. This is the build framework with a very simple (non-queued) spinlock implementation to begin with. Later changes add queueing, and other features and optimisations one-at-a-time. It is done this way to more easily see how the queued spinlocks are built, and to make performance and correctness bisects more useful. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Drop paravirt.h & processor.h changes to fix 32-bit build] [mpe: Fix 32-bit build of qspinlock.o & disallow GENERIC_LOCKBREAK per Nick] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/CONLLQB6DCJU.2ZPOS7T6S5GRR@bobo	2022-12-02 17:48:02 +11:00

27 commits