Large systems can have hundreds of rcu_node structures, and updating
counters in each of them might slow down booting. This commit therefore
updates only the counters in those rcu_node structures corresponding
to the boot CPU, up to and including the root rcu_node structure.
The counters for the remaining rcu_node structures are updated by the
rcu_scheduler_starting() function, which executes just before the first
non-boot kthread is spawned.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Now that rcu_gp_oldstate can accurately track both normal and
expedited grace periods regardless of system state, rcutorture's
rcu_poll_need_2gp() function need only call for a second grace period
for the old single-unsigned-long grace-period polling APIs
This commit therefore adjusts rcu_poll_need_2gp() accordingly.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Because both normal and expedited grace periods increment their respective
counters on their pre-scheduler early boot fastpaths, the rcu_gp_oldstate
structure no longer needs its ->rgos_polled field. This commit therefore
removes this field, shrinking this structure so that it is the same size
as an rcu_head structure.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit causes the early boot single-CPU synchronize_rcu_expedited()
fastpath to update the rcu_state structure's ->expedited_sequence
counter. This will allow the full-state polled grace-period APIs to
detect all expedited grace periods without the need to track the special
combined polling-only counter, which is another step towards removing
the ->rgos_polled field from the rcu_gp_oldstate, thereby reducing its
size by one third.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Now that the expedited grace-period fast path can only happen during
the pre-scheduler portion of early boot, this fast path can no longer
block run-time RCU Trace grace periods. This commit therefore removes
the conditional cond_resched() invocation.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit causes the early boot single-CPU synchronize_rcu() fastpath to
update the rcu_state and rcu_node structures' ->gp_seq and ->gp_seq_needed
counters. This will allow the full-state polled grace-period APIs to
detect all normal grace periods without the need to track the special
combined polling-only counter, which is a step towards removing the
->rgos_polled field from the rcu_gp_oldstate, thereby reducing its size
by one third.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Now that the grace-period fast path can only happen during the
pre-scheduler portion of early boot, this fast path can no longer block
run-time RCU Tasks and RCU Tasks Trace grace periods. This commit
therefore removes the conditional cond_resched_tasks_rcu_qs() invocation.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
It would be good do reduce the size of the rcu_gp_oldstate structure
from three unsigned long instances to two, but this requires that the
boot-time optimized grace periods update the various ->gp_seq fields.
Updating these fields in the rcu_state structure and in all of the
rcu_node structures is at least semi-reasonable, but updating them in
all of the rcu_data structures is a bridge too far. This means that if
there are too many early boot-time grace periods, the ->gp_seq field in
the rcu_data structure cannot be trusted. This commit therefore sets
each rcu_data structure's ->gpwrap field to provide the necessary impetus
for a suitable level of distrust.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The run-time single-CPU grace-period optimization applies only to
kernels built with CONFIG_SMP=y && CONFIG_PREEMPTION=y that are running
on a single-CPU system. But a kernel intended for a single-CPU system
should instead be built with CONFIG_SMP=n, and in any case, single-CPU
systems running Linux no longer appear to be the common case. Plus this
optimization results in the rcu_gp_oldstate structure being half again
larger than it needs to be.
This commit therefore disables the run-time single-CPU grace-period
optimization, so that this optimization applies only during the
pre-scheduler portion of the boot sequence.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The cond_synchronize_rcu_expedited() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods. Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.
This commit therefore adds yet another member of the full-state RCU
grace-period polling API, which is the cond_synchronize_rcu_exp_full()
function. This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The cond_synchronize_rcu() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods. Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.
This commit therefore adds yet another member of the full-state RCU
grace-period polling API, which is the cond_synchronize_rcu_full()
function. This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.
[ paulmck: Apply feedback from kernel test robot and Julia Lawall. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit removes the blank line preceding the oldstate parameter to
the docbook header for the poll_state_synchronize_rcu() function and
marks uses of this parameter later in that header.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The start_poll_synchronize_rcu_expedited() API compresses the combined
expedited and normal grace-period states into a single unsigned long,
which conserves storage, but can miss grace periods in certain cases
involving overlapping normal and expedited grace periods. Missing the
occasional grace period is usually not a problem, but there are use
cases that care about each and every grace period.
This commit therefore adds yet another member of the
full-state RCU grace-period polling API, which is the
start_poll_synchronize_rcu_expedited_full() function. This uses up to
three times the storage (rcu_gp_oldstate structure instead of unsigned
long), but is guaranteed not to miss grace periods.
[ paulmck: Apply feedback from kernel test robot and Julia Lawall. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The start_poll_synchronize_rcu() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods. Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.
This commit therefore adds the next member of the full-state RCU
grace-period polling API, namely the start_poll_synchronize_rcu_full()
function. This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds full-state polling checks to accompany the old-style
polling checks in the rcu_torture_one_read() function. If a polling
cycle within an RCU reader completes, a WARN_ONCE() is triggered.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This check does nothing because the state at this point in the code
because the rcu_torture_writer_state value is guaranteed to instead
be RTWS_REPLACE. This commit therefore removes this check.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a test to rcu_torture_writer() that verifies that a
->get_gp_state_full() and ->poll_gp_state_full() polled grace-period
sequence does not claim that a grace period elapsed within the confines
of the corresponding read-side critical section.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Only vanilla RCU needs a double grace period for its compressed
polled grace-period old-state cookie. This commit therefore adds an
rcu_torture_ops per-flavor function ->poll_need_2gp to allow this check
to be adapted to the RCU flavor under test. A NULL pointer for this
function says that doubled grace periods are never needed.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit abstracts a do_rtws_sync() function that does synchronous
grace-period testing, but also testing the polled API 25% of the time
each for the normal and full-state variants of the polled API.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The get_state_synchronize_rcu() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods. Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.
This commit therefore adds the next member of the full-state RCU
grace-period polling API, namely the get_state_synchronize_rcu_full()
function. This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The get_completed_synchronize_rcu() and poll_state_synchronize_rcu()
APIs compress the combined expedited and normal grace-period states into a
single unsigned long, which conserves storage, but can miss grace periods
in certain cases involving overlapping normal and expedited grace periods.
Missing the occasional grace period is usually not a problem, but there
are use cases that care about each and every grace period.
This commit therefore adds the first members of the full-state RCU
grace-period polling API, namely the get_completed_synchronize_rcu_full()
and poll_state_synchronize_rcu_full() functions. These use up to three
times the storage (rcu_gp_oldstate structure instead of unsigned long),
but which are guaranteed not to miss grace periods, at least in situations
where the single-CPU grace-period optimization does not apply.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Offline CPUs cannot be offloaded or deoffloaded. Any attempt to offload
or deoffload an offline CPU causes a message to be printed on the console,
which is good, but this message does not contain the CPU number, which
is bad. Such a CPU number can be helpful when debugging, as it gives a
clear indication that the CPU in question is in fact offline. This commit
therefore adds the CPU number to the CPU-{,de}offload failure messages.
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The show_rcu_nocb_gp_state() function is supposed to dump out the rcuog
kthread and the show_rcu_nocb_state() function is supposed to dump out
the rcuo[ps] kthread. Currently, both do a mixture, which is not optimal
for debugging, even though it does not affect functionality.
This commit therefore adjusts these two functions to focus on their
respective kthreads.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently the monitor work is scheduled with a fixed interval of HZ/20,
which is roughly 50 milliseconds. The drawback of this approach is
low utilization of the 512 page slots in scenarios with infrequence
kvfree_rcu() calls. For example on an Android system:
<snip>
kworker/3:3-507 [003] .... 470.286305: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=6
kworker/6:1-76 [006] .... 470.416613: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000ea0d6556 nr_records=1
kworker/6:1-76 [006] .... 470.416625: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000003e025849 nr_records=9
kworker/3:3-507 [003] .... 471.390000: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000815a8713 nr_records=48
kworker/1:1-73 [001] .... 471.725785: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000fda9bf20 nr_records=3
kworker/1:1-73 [001] .... 471.725833: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000a425b67b nr_records=76
kworker/0:4-1411 [000] .... 472.085673: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007996be9d nr_records=1
kworker/0:4-1411 [000] .... 472.085728: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=5
kworker/6:1-76 [006] .... 472.260340: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000065630ee4 nr_records=102
<snip>
In many cases, out of 512 slots, fewer than 10 were actually used.
In order to improve batching and make utilization more efficient this
commit sets a drain interval to a fixed 5-seconds interval. Floods are
detected when a page fills quickly, and in that case, the reclaim work
is re-scheduled for the next scheduling-clock tick (jiffy).
After this change:
<snip>
kworker/7:1-371 [007] .... 5630.725708: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000005ab0ffb3 nr_records=121
kworker/7:1-371 [007] .... 5630.989702: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000060c84761 nr_records=47
kworker/7:1-371 [007] .... 5630.989714: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000000babf308 nr_records=510
kworker/7:1-371 [007] .... 5631.553790: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000bb7bd0ef nr_records=169
kworker/7:1-371 [007] .... 5631.553808: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000044c78753 nr_records=510
kworker/5:6-9428 [005] .... 5631.746102: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d98519aa nr_records=123
kworker/4:7-9434 [004] .... 5632.001758: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000526c9d44 nr_records=322
kworker/4:7-9434 [004] .... 5632.002073: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000002c6a8afa nr_records=185
kworker/7:1-371 [007] .... 5632.277515: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007f4a962f nr_records=510
<snip>
Here, all but one of the cases, more than one hundreds slots were used,
representing an order-of-magnitude improvement.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
As per the comments in include/linux/shrinker.h, .count_objects callback
should return the number of freeable items, but if there are no objects
to free, SHRINK_EMPTY should be returned. The only time 0 is returned
should be when we are unable to determine the number of objects, or the
cache should be skipped for another reason.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The fill_page_cache_func() function allocates couple of pages to store
kvfree_rcu_bulk_data structures. This is a lightweight (GFP_NORETRY)
allocation which can fail under memory pressure. The function will,
however keep retrying even when the previous attempt has failed.
This retrying is in theory correct, but in practice the allocation is
invoked from workqueue context, which means that if the memory reclaim
gets stuck, these retries can hog the worker for quite some time.
Although the workqueues subsystem automatically adjusts concurrency, such
adjustment is not guaranteed to happen until the worker context sleeps.
And the fill_page_cache_func() function's retry loop is not guaranteed
to sleep (see the should_reclaim_retry() function).
And we have seen this function cause workqueue lockups:
kernel: BUG: workqueue lockup - pool cpus=93 node=1 flags=0x1 nice=0 stuck for 32s!
[...]
kernel: pool 74: cpus=37 node=0 flags=0x1 nice=0 hung=32s workers=2 manager: 2146
kernel: pwq 498: cpus=249 node=1 flags=0x1 nice=0 active=4/256 refcnt=5
kernel: in-flight: 1917:fill_page_cache_func
kernel: pending: dbs_work_handler, free_work, kfree_rcu_monitor
Originally, we thought that the root cause of this lockup was several
retries with direct reclaim, but this is not yet confirmed. Furthermore,
we have seen similar lockups without any heavy memory pressure. This
suggests that there are other factors contributing to these lockups.
However, it is not really clear that endless retries are desireable.
So let's make the fill_page_cache_func() function back off after
allocation failure.
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_boost_kthread_setaffinity() function removes the outgoing CPU
from the set_cpus_allowed() mask for the corresponding leaf rcu_node
structure's rcub priority-boosting kthread. Except that if the outgoing
CPU will leave that structure without any online CPUs, the mask is set
to the housekeeping CPU mask from housekeeping_cpumask(). Which is fine
unless the outgoing CPU happens to be a housekeeping CPU.
This commit therefore removes the outgoing CPU from the housekeeping mask.
This would of course be problematic if the outgoing CPU was the last
online housekeeping CPU, but in that case you are in a world of hurt
anyway. If someone comes up with a valid use case for a system needing
all the housekeeping CPUs to be offline, further adjustments can be made.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Kernels built with PREEMPT_RCU=y and RCU_STRICT_GRACE_PERIOD=y trigger
irq-work from rcu_read_unlock(), and the resulting irq-work handler
invokes rcu_preempt_deferred_qs_handle(). The point of this triggering
is to force grace periods to end quickly in order to give tools like KASAN
a better chance of detecting RCU usage bugs such as leaking RCU-protected
pointers out of an RCU read-side critical section.
However, this irq-work triggering is unconditional. This works, but
there is no point in doing this irq-work unless the current grace period
is waiting on the running CPU or task, which is not the common case.
After all, in the common case there are many rcu_read_unlock() calls
per CPU per grace period.
This commit therefore triggers the irq-work only when the current grace
period is waiting on the running CPU or task.
This change was tested as follows on a four-CPU system:
echo rcu_preempt_deferred_qs_handler > /sys/kernel/debug/tracing/set_ftrace_filter
echo 1 > /sys/kernel/debug/tracing/function_profile_enabled
insmod rcutorture.ko
sleep 20
rmmod rcutorture.ko
echo 0 > /sys/kernel/debug/tracing/function_profile_enabled
echo > /sys/kernel/debug/tracing/set_ftrace_filter
This procedure produces results in this per-CPU set of files:
/sys/kernel/debug/tracing/trace_stat/function*
Sample output from one of these files is as follows:
Function Hit Time Avg s^2
-------- --- ---- --- ---
rcu_preempt_deferred_qs_handle 838746 182650.3 us 0.217 us 0.004 us
The baseline sum of the "Hit" values (the number of calls to this
function) was 3,319,015. With this commit, that sum was 1,140,359,
for a 2.9x reduction. The worst-case variance across the CPUs was less
than 25%, so this large effect size is statistically significant.
The raw data is available in the Link: URL.
Link: https://lore.kernel.org/all/20220808022626.12825-1-qiang1.zhang@intel.com/
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The dump_cpu_task() function does not print registers on architectures
that do not support NMIs. However, registers can be useful for
debugging. Fortunately, in the case where dump_cpu_task() is invoked
from an interrupt handler and is dumping the current CPU's stack, the
get_irq_regs() function can be used to get the registers.
Therefore, this commit makes dump_cpu_task() check to see if it is being
asked to dump the current CPU's stack from within an interrupt handler,
and, if so, it uses the get_irq_regs() function to obtain the registers.
On systems that do support NMIs, this commit has the further advantage
of avoiding a self-NMI in this case.
This is an example of rcu self-detected stall on arm64, which does not
support NMIs:
[ 27.501721] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 27.502238] rcu: 0-....: (1250 ticks this GP) idle=4f7/1/0x4000000000000000 softirq=2594/2594 fqs=619
[ 27.502632] (t=1251 jiffies g=2989 q=29 ncpus=4)
[ 27.503845] CPU: 0 PID: 306 Comm: test0 Not tainted 5.19.0-rc7-00009-g1c1a6c29ff99-dirty #46
[ 27.504732] Hardware name: linux,dummy-virt (DT)
[ 27.504947] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 27.504998] pc : arch_counter_read+0x18/0x24
[ 27.505301] lr : arch_counter_read+0x18/0x24
[ 27.505328] sp : ffff80000b29bdf0
[ 27.505345] x29: ffff80000b29bdf0 x28: 0000000000000000 x27: 0000000000000000
[ 27.505475] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[ 27.505553] x23: 0000000000001f40 x22: ffff800009849c48 x21: 000000065f871ae0
[ 27.505627] x20: 00000000000025ec x19: ffff80000a6eb300 x18: ffffffffffffffff
[ 27.505654] x17: 0000000000000001 x16: 0000000000000000 x15: ffff80000a6d0296
[ 27.505681] x14: ffffffffffffffff x13: ffff80000a29bc18 x12: 0000000000000426
[ 27.505709] x11: 0000000000000162 x10: ffff80000a2f3c18 x9 : ffff80000a29bc18
[ 27.505736] x8 : 00000000ffffefff x7 : ffff80000a2f3c18 x6 : 00000000759bd013
[ 27.505761] x5 : 01ffffffffffffff x4 : 0002dc6c00000000 x3 : 0000000000000017
[ 27.505787] x2 : 00000000000025ec x1 : ffff80000b29bdf0 x0 : 0000000075a30653
[ 27.505937] Call trace:
[ 27.506002] arch_counter_read+0x18/0x24
[ 27.506171] ktime_get+0x48/0xa0
[ 27.506207] test_task+0x70/0xf0
[ 27.506227] kthread+0x10c/0x110
[ 27.506243] ret_from_fork+0x10/0x20
This is a marked improvement over the old output:
[ 27.944550] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 27.944980] rcu: 0-....: (1249 ticks this GP) idle=cbb/1/0x4000000000000000 softirq=2610/2610 fqs=614
[ 27.945407] (t=1251 jiffies g=2681 q=28 ncpus=4)
[ 27.945731] Task dump for CPU 0:
[ 27.945844] task:test0 state:R running task stack: 0 pid: 306 ppid: 2 flags:0x0000000a
[ 27.946073] Call trace:
[ 27.946151] dump_backtrace.part.0+0xc8/0xd4
[ 27.946378] show_stack+0x18/0x70
[ 27.946405] sched_show_task+0x150/0x180
[ 27.946427] dump_cpu_task+0x44/0x54
[ 27.947193] rcu_dump_cpu_stacks+0xec/0x130
[ 27.947212] rcu_sched_clock_irq+0xb18/0xef0
[ 27.947231] update_process_times+0x68/0xac
[ 27.947248] tick_sched_handle+0x34/0x60
[ 27.947266] tick_sched_timer+0x4c/0xa4
[ 27.947281] __hrtimer_run_queues+0x178/0x360
[ 27.947295] hrtimer_interrupt+0xe8/0x244
[ 27.947309] arch_timer_handler_virt+0x38/0x4c
[ 27.947326] handle_percpu_devid_irq+0x88/0x230
[ 27.947342] generic_handle_domain_irq+0x2c/0x44
[ 27.947357] gic_handle_irq+0x44/0xc4
[ 27.947376] call_on_irq_stack+0x2c/0x54
[ 27.947415] do_interrupt_handler+0x80/0x94
[ 27.947431] el1_interrupt+0x34/0x70
[ 27.947447] el1h_64_irq_handler+0x18/0x24
[ 27.947462] el1h_64_irq+0x64/0x68 <--- the above backtrace is worthless
[ 27.947474] arch_counter_read+0x18/0x24
[ 27.947487] ktime_get+0x48/0xa0
[ 27.947501] test_task+0x70/0xf0
[ 27.947520] kthread+0x10c/0x110
[ 27.947538] ret_from_fork+0x10/0x20
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
The trigger_all_cpu_backtrace() function attempts to send an NMI to the
target CPU, which usually provides much better stack traces than the
dump_cpu_task() function's approach of dumping that stack from some other
CPU. So much so that most calls to dump_cpu_task() only happen after
a call to trigger_all_cpu_backtrace() has failed. And the exception to
this rule really should attempt to use trigger_all_cpu_backtrace() first.
Therefore, move the trigger_all_cpu_backtrace() invocation into
dump_cpu_task().
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Given that rcu_all_qs() is in non-preemptible kernels, why on earth should
it invoke preempt_disable()? This commit adds the reason, which is to
work nicely with debugging enabled in CONFIG_PREEMPT_COUNT=y kernels.
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, only Tree RCU leaks callbacks setting when it detects a
duplicate call_rcu(). This commit causes Tiny RCU to also leak
callbacks in this situation.
Because this is Tiny RCU, kernel size is important:
1. CONFIG_TINY_RCU=y and CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
(Production kernel)
Original:
text data bss dec hex filename
26290663 20159823 15212544 61663030 3ace736 vmlinux
With this commit:
text data bss dec hex filename
26290663 20159823 15212544 61663030 3ace736 vmlinux
2. CONFIG_TINY_RCU=y and CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
(Debugging kernel)
Original:
text data bss dec hex filename
26291319 20160143 15212544 61664006 3aceb06 vmlinux
With this commit:
text data bss dec hex filename
26291319 20160431 15212544 61664294 3acec26 vmlinux
These results show that the kernel size is unchanged for production
kernels, as desired.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Kernels built with CONFIG_PREEMPTION=n and CONFIG_PREEMPT_COUNT=y maintain
preempt_count() state. Because such kernels map __rcu_read_lock()
and __rcu_read_unlock() to preempt_disable() and preempt_enable(),
respectively, this allows the expedited grace period's !CONFIG_PREEMPT_RCU
version of the rcu_exp_handler() IPI handler function to use
preempt_count() to detect quiescent states.
This preempt_count() usage might seem to risk failures due to
use of implicit RCU readers in portions of the kernel under #ifndef
CONFIG_PREEMPTION, except that rcu_core() already disallows such implicit
RCU readers. The moral of this story is that you must use explicit
read-side markings such as rcu_read_lock() or preempt_disable() even if
the code knows that this kernel does not support preemption.
This commit therefore adds a preempt_count()-based check for a quiescent
state in the !CONFIG_PREEMPT_RCU version of the rcu_exp_handler()
function for kernels built with CONFIG_PREEMPT_COUNT=y, reporting an
immediate quiescent state when the interrupted code had both preemption
and softirqs enabled.
This change results in about a 2% reduction in expedited grace-period
latency in kernels built with both CONFIG_PREEMPT_RCU=n and
CONFIG_PREEMPT_COUNT=y.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/all/20220622103549.2840087-1-qiang1.zhang@intel.com/
In non-premptible kernels, tasks never do context switches within
RCU read-side critical sections. Therefore, in such kernels, each
leaf rcu_node structure's ->blkd_tasks list will always be empty.
The comment on the non-preemptible version of rcu_preempt_deferred_qs()
confuses this point, so this commit therefore fixes it.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Kernels built with CONFIG_PREEMPT=n and CONFIG_RCU_STRICT_GRACE_PERIOD=y
report the quiescent state directly from the outermost rcu_read_unlock().
However, the current CPU's rcu_data structure's ->cpu_no_qs.b.norm
might still be set, in which case rcu_report_qs_rdp() will exit early,
thus failing to report quiescent state.
This commit therefore causes rcu_read_unlock_strict() to clear
CPU's rcu_data structure's ->cpu_no_qs.b.norm field before invoking
rcu_report_qs_rdp().
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
We can still see that a majority of the time is spent hashing task pointers:
...
16.98% [kernel] [k] rhashtable_jhash2
...
Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus),
calling task_bp_pinned() for each CPU, even if task_bp_pinned() is
CPU-independent. The reason for this is to update the per-CPU
'tsk_pinned' histogram.
To optimize the CPU-independent case to O(1), keep a separate
CPU-independent 'tsk_pinned_all' histogram.
The major source of complexity are transitions between "all
CPU-independent task breakpoints" and "mixed CPU-independent and
CPU-dependent task breakpoints". The code comments list all cases that
require handling.
After this optimization:
| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 100 threads with 4 breakpoints and 128 parallelism
| Total time: 1.758 [sec]
|
| 34.336621 usecs/op
| 4395.087500 usecs/op/cpu
38.08% [kernel] [k] queued_spin_lock_slowpath
10.81% [kernel] [k] smp_cfm_core_cond
3.01% [kernel] [k] update_sg_lb_stats
2.58% [kernel] [k] osq_lock
2.57% [kernel] [k] llist_reverse_order
1.45% [kernel] [k] find_next_bit
1.21% [kernel] [k] flush_tlb_func_common
1.01% [kernel] [k] arch_install_hw_breakpoint
Showing that the time spent hashing keys has become insignificant.
With the given benchmark parameters, that's an improvement of 12%
compared with the old O(#cpus) version.
And finally, using the less aggressive parameters from the preceding
changes, we now observe:
| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.067 [sec]
|
| 35.292187 usecs/op
| 2258.700000 usecs/op/cpu
Which is an improvement of 12% compared to without the histogram
optimizations (baseline is 40 usecs/op). This is now on par with the
theoretical ideal (constraints disabled), and only 12% slower than no
breakpoints at all.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-15-elver@google.com
Running the perf benchmark with (note: more aggressive parameters vs.
preceding changes, but same 256 CPUs host):
| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 100 threads with 4 breakpoints and 128 parallelism
| Total time: 1.989 [sec]
|
| 38.854160 usecs/op
| 4973.332500 usecs/op/cpu
20.43% [kernel] [k] queued_spin_lock_slowpath
18.75% [kernel] [k] osq_lock
16.98% [kernel] [k] rhashtable_jhash2
8.34% [kernel] [k] task_bp_pinned
4.23% [kernel] [k] smp_cfm_core_cond
3.65% [kernel] [k] bcmp
2.83% [kernel] [k] toggle_bp_slot
1.87% [kernel] [k] find_next_bit
1.49% [kernel] [k] __reserve_bp_slot
We can see that a majority of the time is now spent hashing task
pointers to index into task_bps_ht in task_bp_pinned().
Obtaining the max_bp_pinned_slots() for CPU-independent task targets
currently is O(#cpus), and calls task_bp_pinned() for each CPU, even if
the result of task_bp_pinned() is CPU-independent.
The loop in max_bp_pinned_slots() wants to compute the maximum slots
across all CPUs. If task_bp_pinned() is CPU-independent, we can do so by
obtaining the max slots across all CPUs and adding task_bp_pinned().
To do so in O(1), use a bp_slots_histogram for CPU-pinned slots.
After this optimization:
| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 100 threads with 4 breakpoints and 128 parallelism
| Total time: 1.930 [sec]
|
| 37.697832 usecs/op
| 4825.322500 usecs/op/cpu
19.13% [kernel] [k] queued_spin_lock_slowpath
18.21% [kernel] [k] rhashtable_jhash2
15.46% [kernel] [k] osq_lock
6.27% [kernel] [k] toggle_bp_slot
5.91% [kernel] [k] task_bp_pinned
5.05% [kernel] [k] smp_cfm_core_cond
1.78% [kernel] [k] update_sg_lb_stats
1.36% [kernel] [k] llist_reverse_order
1.34% [kernel] [k] find_next_bit
1.19% [kernel] [k] bcmp
Suggesting that time spent in task_bp_pinned() has been reduced.
However, we're still hashing too much, which will be addressed in the
subsequent change.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-14-elver@google.com
Factor out the existing `atomic_t count[N]` into its own struct called
'bp_slots_histogram', to generalize and make its intent clearer in
preparation of reusing elsewhere. The basic idea of bucketing "total
uses of N slots" resembles a histogram, so calling it such seems most
intuitive.
No functional change.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-13-elver@google.com
While optimizing task_bp_pinned()'s runtime complexity to O(1) on
average helps reduce time spent in the critical section, we still suffer
due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows
that now contention is the biggest issue:
95.93% [kernel] [k] osq_lock
0.70% [kernel] [k] mutex_spin_on_owner
0.22% [kernel] [k] smp_cfm_core_cond
0.18% [kernel] [k] task_bp_pinned
0.18% [kernel] [k] rhashtable_jhash2
0.15% [kernel] [k] queued_spin_lock_slowpath
when running the breakpoint benchmark with (system with 256 CPUs):
| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.207 [sec]
|
| 108.267188 usecs/op
| 6929.100000 usecs/op/cpu
The main concern for synchronizing the breakpoint constraints data is
that a consistent snapshot of the per-CPU and per-task data is observed.
The access pattern is as follows:
1. If the target is a task: the task's pinned breakpoints are counted,
checked for space, and then appended to; only bp_cpuinfo::cpu_pinned
is used to check for conflicts with CPU-only breakpoints;
bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise
unused.
2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along
with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is
incremented. No per-task breakpoints are checked.
Since rhltable safely synchronizes insertions/deletions, we can allow
concurrency as follows:
1. If the target is a task: independent tasks may update and check the
constraints concurrently, but same-task target calls need to be
serialized; since bp_cpuinfo::tsk_pinned is only updated, but not
checked, these modifications can happen concurrently by switching
tsk_pinned to atomic_t.
2. If the target is a CPU: access to the per-CPU constraints needs to
be serialized with other CPU-target and task-target callers (to
stabilize the bp_cpuinfo::tsk_pinned snapshot).
We can allow the above concurrency by introducing a per-CPU constraints
data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses
task_struct::perf_event_mutex):
1. If the target is a task: acquires perf_event_mutex, and acquires
bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes
contention in the presence of many read-lock but few write-lock
acquisitions: we assume many orders of magnitude more task target
breakpoints creations/destructions than CPU target breakpoints.
2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer.
With these changes, contention with thousands of tasks is reduced to the
point where waiting on locking no longer dominates the profile:
| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.077 [sec]
|
| 40.201563 usecs/op
| 2572.900000 usecs/op/cpu
21.54% [kernel] [k] task_bp_pinned
20.18% [kernel] [k] rhashtable_jhash2
6.81% [kernel] [k] toggle_bp_slot
5.47% [kernel] [k] queued_spin_lock_slowpath
3.75% [kernel] [k] smp_cfm_core_cond
3.48% [kernel] [k] bcmp
On this particular setup that's a speedup of 2.7x.
We're also getting closer to the theoretical ideal performance through
optimizations in hw_breakpoint.c -- constraints accounting disabled:
| perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.067 [sec]
|
| 35.286458 usecs/op
| 2258.333333 usecs/op/cpu
Which means the current implementation is ~12% slower than the
theoretical ideal.
For reference, performance without any breakpoints:
| $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 0 breakpoints and 64 parallelism
| Total time: 0.060 [sec]
|
| 31.365625 usecs/op
| 2007.400000 usecs/op/cpu
On a system with 256 CPUs, the theoretical ideal is only ~12% slower
than no breakpoints at all; the current implementation is ~28% slower.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com
Flexible breakpoints have never been implemented, with
bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4
bytes in each bp_cpuinfo and bp_busy_slots, as well as computing the max
flexible count in fetch_bp_busy_slots().
This again causes suboptimal code generation, when we always know that
`!!slots.flexible` will be 0.
Just get rid of the flexible "placeholder" and remove all real code
related to it. Make a note in the comment related to the constraints
algorithm but don't remove them from the algorithm, so that if in future
flexible breakpoints need supporting, it should be trivial to revive
them (along with reverting this change).
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-9-elver@google.com
Due to being a __weak function, hw_breakpoint_weight() will cause the
compiler to always emit a call to it. This generates unnecessarily bad
code (register spills etc.) for no good reason; in fact it appears in
profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`:
...
0.70% [kernel] [k] hw_breakpoint_weight
...
While a small percentage, no architecture defines its own
hw_breakpoint_weight() nor are there users outside hw_breakpoint.c,
which makes the fact it is currently __weak a poor choice.
Change hw_breakpoint_weight()'s definition to follow a similar protocol
to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines
hw_breakpoint_weight(), we'll use it instead.
The result is that it is inlined and no longer shows up in profiles.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-8-elver@google.com
Optimize internal hw_breakpoint state if the architecture's number of
breakpoint slots is constant. This avoids several kmalloc() calls and
potentially unnecessary failures if the allocations fail, as well as
subtly improves code generation and cache locality.
The protocol is that if an architecture defines hw_breakpoint_slots via
the preprocessor, it must be constant and the same for all types.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-7-elver@google.com
Mark read-only data after initialization as __ro_after_init.
While we are here, turn 'constraints_initialized' into a bool.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-6-elver@google.com
On a machine with 256 CPUs, running the recently added perf breakpoint
benchmark results in:
| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 236.418 [sec]
|
| 123134.794271 usecs/op
| 7880626.833333 usecs/op/cpu
The benchmark tests inherited breakpoint perf events across many
threads.
Looking at a perf profile, we can see that the majority of the time is
spent in various hw_breakpoint.c functions, which execute within the
'nr_bp_mutex' critical sections which then results in contention on that
mutex as well:
37.27% [kernel] [k] osq_lock
34.92% [kernel] [k] mutex_spin_on_owner
12.15% [kernel] [k] toggle_bp_slot
11.90% [kernel] [k] __reserve_bp_slot
The culprit here is task_bp_pinned(), which has a runtime complexity of
O(#tasks) due to storing all task breakpoints in the same list and
iterating through that list looking for a matching task. Clearly, this
does not scale to thousands of tasks.
Instead, make use of the "rhashtable" variant "rhltable" which stores
multiple items with the same key in a list. This results in average
runtime complexity of O(1) for task_bp_pinned().
With the optimization, the benchmark shows:
| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.208 [sec]
|
| 108.422396 usecs/op
| 6939.033333 usecs/op/cpu
On this particular setup that's a speedup of ~1135x.
While one option would be to make task_struct a breakpoint list node,
this would only further bloat task_struct for infrequently used data.
Furthermore, after all optimizations in this series, there's no evidence
it would result in better performance: later optimizations make the time
spent looking up entries in the hash table negligible (we'll reach the
theoretical ideal performance i.e. no constraints).
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-5-elver@google.com
Clean up headers:
- Remove unused <linux/kallsyms.h>
- Remove unused <linux/kprobes.h>
- Remove unused <linux/module.h>
- Remove unused <linux/smp.h>
- Add <linux/export.h> for EXPORT_SYMBOL_GPL().
- Add <linux/mutex.h> for mutex.
- Sort alphabetically.
- Move <linux/hw_breakpoint.h> to top to test it compiles on its own.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-4-elver@google.com
Provide hw_breakpoint_is_used() to check if breakpoints are in use on
the system.
Use it in the KUnit test to verify the global state before and after a
test case.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-3-elver@google.com
Add KUnit test for hw_breakpoint constraints accounting, with various
interesting mixes of breakpoint targets (some care was taken to catch
interesting corner cases via bug-injection).
The test cannot be built as a module because it requires access to
hw_breakpoint_slots(), which is not inlinable or exported on all
architectures.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-2-elver@google.com