The NUMA topology parameters (sched_numa_topology_type,
sched_domains_numa_levels, and sched_max_numa_distance, etc.)
identified by scheduler may be wrong for systems with CPU-less nodes.
For example, the ACPI SLIT of a system with CPU-less persistent
memory (Intel Optane DCPMM) nodes is as follows,
[000h 0000 4] Signature : "SLIT" [System Locality Information Table]
[004h 0004 4] Table Length : 0000042C
[008h 0008 1] Revision : 01
[009h 0009 1] Checksum : 59
[00Ah 0010 6] Oem ID : "XXXX"
[010h 0016 8] Oem Table ID : "XXXXXXX"
[018h 0024 4] Oem Revision : 00000001
[01Ch 0028 4] Asl Compiler ID : "INTL"
[020h 0032 4] Asl Compiler Revision : 20091013
[024h 0036 8] Localities : 0000000000000004
[02Ch 0044 4] Locality 0 : 0A 15 11 1C
[030h 0048 4] Locality 1 : 15 0A 1C 11
[034h 0052 4] Locality 2 : 11 1C 0A 1C
[038h 0056 4] Locality 3 : 1C 11 1C 0A
While the `numactl -H` output is as follows,
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 64136 MB
node 0 free: 5981 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 64466 MB
node 1 free: 10415 MB
node 2 cpus:
node 2 size: 253952 MB
node 2 free: 253920 MB
node 3 cpus:
node 3 size: 253952 MB
node 3 free: 253951 MB
node distances:
node 0 1 2 3
0: 10 21 17 28
1: 21 10 28 17
2: 17 28 10 28
3: 28 17 28 10
In this system, there are only 2 sockets. In each memory controller,
both DRAM and PMEM DIMMs are installed. Although the physical NUMA
topology is simple, the logical NUMA topology becomes a little
complex. Because both the distance(0, 1) and distance (1, 3) are less
than the distance (0, 3), it appears that node 1 sits between node 0
and node 3. And the whole system appears to be a glueless mesh NUMA
topology type. But it's definitely not, there is even no CPU in node 3.
This isn't a practical problem now yet. Because the PMEM nodes (node
2 and node 3 in example system) are offlined by default during system
boot. So init_numa_topology_type() called during system boot will
ignore them and set sched_numa_topology_type to NUMA_DIRECT. And
init_numa_topology_type() is only called at runtime when a CPU of a
never-onlined-before node gets plugged in. And there's no CPU in the
PMEM nodes. But it appears better to fix this to make the code more
robust.
To test the potential problem. We have used a debug patch to call
init_numa_topology_type() when the PMEM node is onlined (in
__set_migration_target_nodes()). With that, the NUMA parameters
identified by scheduler is as follows,
sched_numa_topology_type: NUMA_GLUELESS_MESH
sched_domains_numa_levels: 4
sched_max_numa_distance: 28
To fix the issue, the CPU-less nodes are ignored when the NUMA topology
parameters are identified. Because a node may become CPU-less or not
at run time because of CPU hotplug, the NUMA topology parameters need
to be re-initialized at runtime for CPU hotplug too.
With the patch, the NUMA parameters identified for the example system
above is as follows,
sched_numa_topology_type: NUMA_DIRECT
sched_domains_numa_levels: 2
sched_max_numa_distance: 21
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com
In some places, kernel/sched code calls cpumask_weight() to check if
any bit of a given cpumask is set. We can do it more efficiently with
cpumask_empty() because cpumask_empty() stops traversing the cpumask as
soon as it finds first set bit, while cpumask_weight() counts all bits
unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220210224933.379149-23-yury.norov@gmail.com
WRITE_KERN is supposed to overwrite some kernel text, namely
do_overwritten() function.
But at the time being it overwrites do_overwritten() function
descriptor, not function text.
Fix it by dereferencing the function descriptor to obtain
function text pointer. Export dereference_function_descriptor()
for when LKDTM is built as a module.
And make do_overwritten() noinline so that it is really
do_overwritten() which is called by lkdtm_WRITE_KERN().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/31e58eaffb5bc51c07d8d4891d1982100ade8cfc.1644928018.git.christophe.leroy@csgroup.eu
dereference_function_descriptor() and
dereference_kernel_function_descriptor() are identical on the
three architectures implementing them.
Make them common and put them out-of-line in kernel/extable.c
which is one of the users and has similar type of functions.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/449db09b2eba57f4ab05f80102a67d8675bc8bcd.1644928018.git.christophe.leroy@csgroup.eu
Now kfunc call uses s32 to represent the offset between the address of
kfunc and __bpf_call_base, but it doesn't check whether or not s32 will
be overflowed. The overflow is possible when kfunc is in module and the
offset between module and kernel is greater than 2GB. Take arm64 as an
example, before commit b2eed9b588 ("arm64/kernel: kaslr: reduce module
randomization range to 2 GB"), the offset between module symbol and
__bpf_call_base will in 4GB range due to KASLR and may overflow s32.
So add an extra checking to reject these invalid kfunc calls.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220215065732.3179408-1-houtao1@huawei.com
The per-cpu @printk_pending variable can be updated from
sleepable contexts, such as:
get_random_bytes()
warn_unseeded_randomness()
printk_deferred()
defer_console_output()
and can be updated from interrupt contexts, such as:
handle_irq_event_percpu()
__irq_wake_thread()
wake_up_process()
try_to_wake_up()
select_task_rq()
select_fallback_rq()
printk_deferred()
defer_console_output()
and can be updated from NMI contexts, such as:
vprintk()
if (in_nmi()) defer_console_output()
Therefore the atomic variant of the updating functions must be used.
Replace __this_cpu_xchg() with this_cpu_xchg().
Replace __this_cpu_or() with this_cpu_or().
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/87iltld4ue.fsf@jogness.linutronix.de
In order to let a const irqchip be fed to the irqchip layer, adjust
the various prototypes. An extra cast in irq_set_chip()() is required
to avoid a warning.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20220209162607.1118325-3-maz@kernel.org
In order to let a const irqchip be fed to the irqchip layer, adjust
the various prototypes. An extra cast in irq_domain_set_hwirq_and_chip()
is required to avoid a warning.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20220209162607.1118325-2-maz@kernel.org
Add a new single bit field to the task structure to track whether this task
has initialized the IA32_PASID MSR to the mm's PASID.
Initialize the field to zero when creating a new task with fork/clone.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-8-fenghua.yu@intel.com
PASIDs are process-wide. It was attempted to use refcounted PASIDs to
free them when the last thread drops the refcount. This turned out to
be complex and error prone. Given the fact that the PASID space is 20
bits, which allows up to 1M processes to have a PASID associated
concurrently, PASID resource exhaustion is not a realistic concern.
Therefore, it was decided to simplify the approach and stick with lazy
on demand PASID allocation, but drop the eager free approach and make an
allocated PASID's lifetime bound to the lifetime of the process.
Get rid of the refcounting mechanisms and replace/rename the interfaces
to reflect this new approach.
[ bp: Massage commit message. ]
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-6-fenghua.yu@intel.com
As previously discussed(https://lkml.org/lkml/2022/1/20/51),
cpuset_attach() is affected with similar cpu hotplug race,
as follow scenario:
cpuset_attach() cpu hotplug
--------------------------- ----------------------
down_write(cpuset_rwsem)
guarantee_online_cpus() // (load cpus_attach)
sched_cpu_deactivate
set_cpu_active()
// will change cpu_active_mask
set_cpus_allowed_ptr(cpus_attach)
__set_cpus_allowed_ptr_locked()
// (if the intersection of cpus_attach and
cpu_active_mask is empty, will return -EINVAL)
up_write(cpuset_rwsem)
To avoid races such as described above, protect cpuset_attach() call
with cpu_hotplug_lock.
Fixes: be367d0992 ("cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time")
Cc: stable@vger.kernel.org # v2.6.32+
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A new mm doesn't have a PASID yet when it's created. Initialize
the mm's PASID on fork() or for init_mm to INVALID_IOASID (-1).
INIT_PASID (0) is reserved for kernel legacy DMA PASID. It cannot be
allocated to a user process. Initializing the process's PASID to 0 may
cause confusion that's why the process uses the reserved kernel legacy
DMA PASID. Initializing the PASID to INVALID_IOASID (-1) explicitly
tells the process doesn't have a valid PASID yet.
Even though the only user of mm_pasid_init() is in fork.c, define it in
<linux/sched/mm.h> as the first of three mm/pasid life cycle functions
(init/set/drop) to keep these all together.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-5-fenghua.yu@intel.com
In some places, RCU code calls cpumask_weight() to check if any bit of a
given cpumask is set. We can do it more efficiently with cpumask_empty()
because cpumask_empty() stops traversing the cpumask as soon as it finds
first set bit, while cpumask_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This is a rarely used function, so uninlining its 3 instructions
is probably a win or a wash - but the main motivation is to
make <linux/rcuwait.h> independent of task_struct details.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
KCSAN reports data races between the rcu_segcblist_clear_flags() and
rcu_segcblist_set_flags() functions, though misreporting the latter
as a call to rcu_segcblist_is_enabled() from call_rcu(). This commit
converts the updates of this field to WRITE_ONCE(), relying on the
resulting unmarked reads to continue to detect buggy concurrent writes
to this field.
Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Recording the work creation stack trace for KASAN reports in
call_rcu() is expensive, due to unwinding the stack, but also
due to acquiring depot_lock inside stackdepot (which may be contended).
Because calling kasan_record_aux_stack_noalloc() does not require
interrupts to already be disabled, this may unnecessarily extend
the time with interrupts disabled.
Therefore, move calling kasan_record_aux_stack() before the section
with interrupts disabled.
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Because __call_rcu() is invoked only by call_rcu(), this commit inlines
the former into the latter.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
As we handle parallel CPU bringup, we will need to take care to avoid
spawning multiple boost threads, or race conditions when setting their
affinity. Spotted by Paul McKenney.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This currently depends on CONFIG_IOMMU_SUPPORT. But it is only
needed when CONFIG_IOMMU_SVA option is enabled.
Change the CONFIG guards around definition and initialization
of mm->pasid field.
Suggested-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20220207230254.3342514-3-fenghua.yu@intel.com
If another CPU is in panic, we are about to be halted. Try to gracefully
abandon the console_sem, leaving it free for the panic CPU to grab.
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220202171821.179394-5-stephen.s.brennan@oracle.com
During panic(), if another CPU is writing heavily the kernel log (e.g.
via /dev/kmsg), then the panic CPU may livelock writing out its messages
to the console. Note when too many messages are dropped during panic and
suppress further printk, except from the panic CPU. This could result in
some important messages being dropped. However, messages are already
being dropped, so this approach at least prevents a livelock.
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220202171821.179394-4-stephen.s.brennan@oracle.com
A CPU executing with console lock spinning enabled might be halted
during a panic. Before the panicking CPU calls console_flush_on_panic(),
it may call console_trylock(), which attempts to optimistically spin,
deadlocking the panic CPU:
CPU 0 (panic CPU) CPU 1
----------------- ------
printk() {
vprintk_func() {
vprintk_default() {
vprintk_emit() {
console_unlock() {
console_lock_spinning_enable();
... printing to console ...
panic() {
crash_smp_send_stop() {
NMI -------------------> HALT
}
atomic_notifier_call_chain() {
printk() {
...
console_trylock_spinnning() {
// optimistic spin infinitely
This hang during panic can be induced when a kdump kernel is loaded, and
crash_kexec_post_notifiers=1 is present on the kernel command line. The
following script which concurrently writes to /dev/kmsg, and triggers a
panic, can result in this hang:
#!/bin/bash
date
# 991 chars (based on log buffer size):
chars="$(printf 'a%.0s' {1..991})"
while :; do
echo $chars > /dev/kmsg
done &
echo c > /proc/sysrq-trigger &
date
exit
To avoid this deadlock, ensure that console_trylock_spinning() does not
allow spinning once a panic has begun.
Fixes: dbdda842fe ("printk: Add console owner and waiter logic to load balance console writes")
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220202171821.179394-3-stephen.s.brennan@oracle.com
This will be used help avoid deadlocks during panics. Although it would
be better to include this in linux/panic.h, it would require that header
to include linux/atomic.h as well. On some architectures, this results
in a circular dependency as well. So instead add the helper directly to
printk.c.
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220202171821.179394-2-stephen.s.brennan@oracle.com
The problem I'm addressing was discovered by the LTP test covering
cve-2018-1000204.
A short description of what happens follows:
1) The test case issues a command code 00 (TEST UNIT READY) via the SG_IO
interface with: dxfer_len == 524288, dxdfer_dir == SG_DXFER_FROM_DEV
and a corresponding dxferp. The peculiar thing about this is that TUR
is not reading from the device.
2) In sg_start_req() the invocation of blk_rq_map_user() effectively
bounces the user-space buffer. As if the device was to transfer into
it. Since commit a45b599ad8 ("scsi: sg: allocate with __GFP_ZERO in
sg_build_indirect()") we make sure this first bounce buffer is
allocated with GFP_ZERO.
3) For the rest of the story we keep ignoring that we have a TUR, so the
device won't touch the buffer we prepare as if the we had a
DMA_FROM_DEVICE type of situation. My setup uses a virtio-scsi device
and the buffer allocated by SG is mapped by the function
virtqueue_add_split() which uses DMA_FROM_DEVICE for the "in" sgs (here
scatter-gather and not scsi generics). This mapping involves bouncing
via the swiotlb (we need swiotlb to do virtio in protected guest like
s390 Secure Execution, or AMD SEV).
4) When the SCSI TUR is done, we first copy back the content of the second
(that is swiotlb) bounce buffer (which most likely contains some
previous IO data), to the first bounce buffer, which contains all
zeros. Then we copy back the content of the first bounce buffer to
the user-space buffer.
5) The test case detects that the buffer, which it zero-initialized,
ain't all zeros and fails.
One can argue that this is an swiotlb problem, because without swiotlb
we leak all zeros, and the swiotlb should be transparent in a sense that
it does not affect the outcome (if all other participants are well
behaved).
Copying the content of the original buffer into the swiotlb buffer is
the only way I can think of to make swiotlb transparent in such
scenarios. So let's do just that if in doubt, but allow the driver
to tell us that the whole mapped buffer is going to be overwritten,
in which case we can preserve the old behavior and avoid the performance
impact of the extra bounce.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
- Force HANDLER_EXIT even for SIGNAL_UNKILLABLE.
- Make seccomp self-destruct after fatal filter results.
- Update seccomp samples for easier behavioral demonstration.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmIHIccWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkuFD/9wY3voM9ryF6WW6XRz0tJ8qG6X
YUPMp+mzyDOlF6OGcSXQL1P0gV8frfAJx23sBW0dTayWGiwEOSqi83gHVnxU6jpn
KXfS7JOkcbY3x1A7hL38O4nIh0wkbtuHslkp6iWRrBsE4KsFY1r2pR2hYUEOJsgn
AALsnnFdQ57JD9tbfxrTzwHMgB9Rx3ROoKoTmoG78GU5vG3RvXYyeGN0JXrX2lM3
9N6oSYp+j2X6DpKjciQ8dsyBUpPNTfUad5v+MhsYbffwKbk+xPulJnK21KpqrfPY
rBLvR8DAFBZAGgXuO91m8Pb5PoBOkh6bb0ImbGB1W2u11/5dFriiAEl2YNuZin1z
d+BuuvPZ2i4g+plr1Inl/xkKPLIlobSBDGO/XGsZ+CQnI4ZzJm0ilJaxTPCwJ7Ky
4YzyaI66AKYcGUpTg5hNScEoRuC5kXCnIyFS+DLaBT/b1/C17MW7xsGW44UcXpS2
ROlDKgZ+fXwKlIuz0QCwjMMv5EIe8sjRpoHGd4CctodbRN3bznRNs4IXkeYytOTU
3vjkaYoyYCoPP370qm0NkDByh0JY5tQwLKlBfb4f5mZP8Gsi7kHQW6mBk0a0xwkB
i3PtzbRcIcRAGs2nm8qBxsAaEtCePY+V080+bEuV8NwI7Dz5C+xMtKCWa6CxmJtu
D//y/M2p/TLc42oZAA==
=zpOl
-----END PGP SIGNATURE-----
Merge tag 'seccomp-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fixes from Kees Cook:
"This fixes a corner case of fatal SIGSYS being ignored since v5.15.
Along with the signal fix is a change to seccomp so that seeing
another syscall after a fatal filter result will cause seccomp to kill
the process harder.
Summary:
- Force HANDLER_EXIT even for SIGNAL_UNKILLABLE
- Make seccomp self-destruct after fatal filter results
- Update seccomp samples for easier behavioral demonstration"
* tag 'seccomp-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
samples/seccomp: Adjust sample to also provide kill option
seccomp: Invalidate seccomp mode to catch death failures
signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE
Commit 7d2b5dd0bc ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.
On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are
5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v6
MB/sec copy-16 162596.94 ( 0.00%) 580559.74 ( 257.05%)
MB/sec scale-16 136901.28 ( 0.00%) 374450.52 ( 173.52%)
MB/sec add-16 157300.70 ( 0.00%) 564113.76 ( 258.62%)
MB/sec triad-16 151446.88 ( 0.00%) 564304.24 ( 272.61%)
STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.
Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;
5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v5
Min Score-16 368239.36 ( 0.00%) 389816.06 ( 5.86%)
Hmean Score-16 388607.33 ( 0.00%) 427877.08 * 10.11%*
Max Score-16 408945.69 ( 0.00%) 481022.17 ( 17.62%)
Stddev Score-16 15247.04 ( 0.00%) 24966.82 ( -63.75%)
CoeffVar Score-16 3.92 ( 0.00%) 5.82 ( -48.48%)
It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed. Even in cases where
the average performance is neutral, the results are more stable.
5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v6
Hmean tput-1 71631.55 ( 0.00%) 73065.57 ( 2.00%)
Hmean tput-8 582758.78 ( 0.00%) 556777.23 ( -4.46%)
Hmean tput-16 1020372.75 ( 0.00%) 1009995.26 ( -1.02%)
Hmean tput-24 1416430.67 ( 0.00%) 1398700.11 ( -1.25%)
Hmean tput-32 1687702.72 ( 0.00%) 1671357.04 ( -0.97%)
Hmean tput-40 1798094.90 ( 0.00%) 2015616.46 * 12.10%*
Hmean tput-48 1972731.77 ( 0.00%) 2333233.72 ( 18.27%)
Hmean tput-56 2386872.38 ( 0.00%) 2759483.38 ( 15.61%)
Hmean tput-64 2909475.33 ( 0.00%) 2925074.69 ( 0.54%)
Hmean tput-72 2585071.36 ( 0.00%) 2962443.97 ( 14.60%)
Hmean tput-80 2994387.24 ( 0.00%) 3015980.59 ( 0.72%)
Hmean tput-88 3061408.57 ( 0.00%) 3010296.16 ( -1.67%)
Hmean tput-96 3052394.82 ( 0.00%) 2784743.41 ( -8.77%)
Hmean tput-104 2997814.76 ( 0.00%) 2758184.50 ( -7.99%)
Hmean tput-112 2955353.29 ( 0.00%) 2859705.09 ( -3.24%)
Hmean tput-120 2889770.71 ( 0.00%) 2764478.46 ( -4.34%)
Hmean tput-128 2871713.84 ( 0.00%) 2750136.73 ( -4.23%)
Stddev tput-1 5325.93 ( 0.00%) 2002.53 ( 62.40%)
Stddev tput-8 6630.54 ( 0.00%) 10905.00 ( -64.47%)
Stddev tput-16 25608.58 ( 0.00%) 6851.16 ( 73.25%)
Stddev tput-24 12117.69 ( 0.00%) 4227.79 ( 65.11%)
Stddev tput-32 27577.16 ( 0.00%) 8761.05 ( 68.23%)
Stddev tput-40 59505.86 ( 0.00%) 2048.49 ( 96.56%)
Stddev tput-48 168330.30 ( 0.00%) 93058.08 ( 44.72%)
Stddev tput-56 219540.39 ( 0.00%) 30687.02 ( 86.02%)
Stddev tput-64 121750.35 ( 0.00%) 9617.36 ( 92.10%)
Stddev tput-72 223387.05 ( 0.00%) 34081.13 ( 84.74%)
Stddev tput-80 128198.46 ( 0.00%) 22565.19 ( 82.40%)
Stddev tput-88 136665.36 ( 0.00%) 27905.97 ( 79.58%)
Stddev tput-96 111925.81 ( 0.00%) 99615.79 ( 11.00%)
Stddev tput-104 146455.96 ( 0.00%) 28861.98 ( 80.29%)
Stddev tput-112 88740.49 ( 0.00%) 58288.23 ( 34.32%)
Stddev tput-120 186384.86 ( 0.00%) 45812.03 ( 75.42%)
Stddev tput-128 78761.09 ( 0.00%) 57418.48 ( 27.10%)
Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.
vanilla sched-numaimb-v6
Min ep.D 31.79 ( 0.00%) 26.11 ( 17.87%)
Amean ep.D 31.86 ( 0.00%) 26.17 * 17.86%*
Stddev ep.D 0.07 ( 0.00%) 0.05 ( 24.41%)
CoeffVar ep.D 0.22 ( 0.00%) 0.20 ( 7.97%)
Max ep.D 31.93 ( 0.00%) 26.21 ( 17.91%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net
There are inconsistencies when determining if a NUMA imbalance is allowed
that should be corrected.
o allow_numa_imbalance changes types and is not always examining
the destination group so both the type should be corrected as
well as the naming.
o find_idlest_group uses the sched_domain's weight instead of the
group weight which is different to find_busiest_group
o find_busiest_group uses the source group instead of the destination
which is different to task_numa_find_cpu
o Both find_idlest_group and find_busiest_group should account
for the number of running tasks if a move was allowed to be
consistent with task_numa_find_cpu
Fixes: 7d2b5dd0bc ("sched/numa: Allow a floating imbalance between NUMA nodes")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net
A kernel exception was hit when trying to dump /proc/lockdep_chains after
lockdep report "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!":
Unable to handle kernel paging request at virtual address 00054005450e05c3
...
00054005450e05c3] address between user and kernel address ranges
...
pc : [0xffffffece769b3a8] string+0x50/0x10c
lr : [0xffffffece769ac88] vsnprintf+0x468/0x69c
...
Call trace:
string+0x50/0x10c
vsnprintf+0x468/0x69c
seq_printf+0x8c/0xd8
print_name+0x64/0xf4
lc_show+0xb8/0x128
seq_read_iter+0x3cc/0x5fc
proc_reg_read_iter+0xdc/0x1d4
The cause of the problem is the function lock_chain_get_class() will
shift lock_classes index by 1, but the index don't need to be shifted
anymore since commit 01bb6f0af9 ("locking/lockdep: Change the range
of class_idx in held_lock struct") already change the index to start
from 0.
The lock_classes[-1] located at chain_hlocks array. When printing
lock_classes[-1] after the chain_hlocks entries are modified, the
exception happened.
The output of lockdep_chains are incorrect due to this problem too.
Fixes: f611e8cf98 ("lockdep: Take read/write status in consideration when generate chainkey")
Signed-off-by: Cheng Jui Wang <cheng-jui.wang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20220210105011.21712-1-cheng-jui.wang@mediatek.com
Currently the following code in check_and_init_map_value()
*(struct bpf_timer *)(dst + map->timer_off) =
(struct bpf_timer){};
can help generate bpf_timer definition in vmlinuxBTF.
But the code above may not zero the whole structure
due to anonymour members and that code will be replaced
by memset in the subsequent patch and
bpf_timer definition will disappear from vmlinuxBTF.
Let us emit the type explicitly so bpf program can continue
to use it from vmlinux.h.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220211194948.3141529-1-yhs@fb.com
- Revert a recent change that attempted to avoid issues with
conflicting address ranges during PCI initialization, because it
turned out to introduce a regression (Hans de Goede).
- Revert a change that limited EC GPE wakeups from suspend-to-idle
to systems based on Intel hardware, because it turned out that
systems based on hardware from other vendors depended on that
functionality too (Mario Limonciello).
- Fix two issues related to the handling of wakeup interrupts and
wakeup events signaled through the EC GPE during suspend-to-idle
on x86 (Rafael Wysocki).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmIGkq4SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxs3MQAKhmnM5mHOqFIM7VM9SoutEoOIXPWLCz
CNjjqfwA63GvxTp//Dks4KGCLF90mfFOv8Mt7vlzVHyQDvTv2w0YddmBxQAgXTU4
v1yfHcRssND1w2kgIg7wThWSxSKvP/4XgqwrhTmEKHb1qPFrpq9pt7+4RIco9362
pacxW1QPX2+5yNDs67JuTsjh2mksKC8CUgiA8BUa+57jrIXvQYWqqZ490PpkbXW0
Nh1naNwT1xDqte5U98PrzZZRt2qUMuKrG2ro0lE97rb067zSoIQo2XCODlo7T12O
7vFzypOTvNJZjd57U9SoKyEDCzDnHkNh2O2jiNKaqzMJHqgh0bkMrMHysPbdTwMu
VH0fw+VElRONyQ5knUcvTG3IRfEuTBl2iqDoO+hLb+cmkL48KL9yJXH7dTRMrNJ8
zqHSCqpON8rZgLfDzrxoVvMXv9Al5ra5wM41EljiUWUDFEB1HbtleysHIMFMxEKy
gBvO6zAyEoh5ie2pxYdSAyY+pq+d1JFeAgpbXpF3miQFE2gchC1BjNjUCguekfty
t0kwOefLOphgShpIm/04F2WBsYNM+QSIoJWRvE5LsEVTdD/j59MweCTJN2KNSIpj
vapl8ymAr6umMrPhxpjY6K4DD000+LYpV9WJ8Q1Wv2HWV9OkevL5KnEXoQ94czpb
ybb/Nd4SwHV+
=aGYp
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These revert two commits that turned out to be problematic and fix two
issues related to wakeup from suspend-to-idle on x86.
Specifics:
- Revert a recent change that attempted to avoid issues with
conflicting address ranges during PCI initialization, because it
turned out to introduce a regression (Hans de Goede).
- Revert a change that limited EC GPE wakeups from suspend-to-idle to
systems based on Intel hardware, because it turned out that systems
based on hardware from other vendors depended on that functionality
too (Mario Limonciello).
- Fix two issues related to the handling of wakeup interrupts and
wakeup events signaled through the EC GPE during suspend-to-idle on
x86 (Rafael Wysocki)"
* tag 'acpi-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
x86/PCI: revert "Ignore E820 reservations for bridge windows on newer systems"
PM: s2idle: ACPI: Fix wakeup interrupts handling
ACPI: PM: s2idle: Cancel wakeup before dispatching EC GPE
ACPI: PM: Revert "Only mark EC GPE for wakeup on Intel systems"
- Fixes to the RTLA tooling.
- A fix to a tp_printk overriding tp_printk_stop_on_boot on command line.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYgWtxxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsaaAQD+4lcpIRKdkfGb09xMlh8Gr8OvRoVb
5XAhzHVpETjGUAEAyIJAG+7Epw/St8FCSupNAEWTzGghjhoJhFblTd17jAg=
=CCON
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fixes to the RTLA tooling
- A fix to a tp_printk overriding tp_printk_stop_on_boot on the
command line
* tag 'trace-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix tp_printk option related with tp_printk_stop_on_boot
MAINTAINERS: Add RTLA entry
rtla: Fix segmentation fault when failing to enable -t
rtla/trace: Error message fixup
rtla/utils: Fix session duration parsing
rtla: Follow kernel version
This patch adds __sched attributes to a few missing places
to show blocked function rather than locking function
in get_wchan.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220115231657.84828-1-minchan@kernel.org
Add validation to ensure data is at or greater than the min size for the
fields of the event. If a dynamic array is used and is a type of char,
ensure null termination of the array exists.
Link: https://lkml.kernel.org/r/20220118204326.2169-7-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pass iterator through to probes to allow copying data directly to the
probe buffers instead of taking multiple copies. Enables eBPF user and
raw iterator types out to programs for no-copy scenarios.
Link: https://lkml.kernel.org/r/20220118204326.2169-6-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Adds support to write out user_event data to perf_probe/perf files as
well as to any attached eBPF program.
Link: https://lkml.kernel.org/r/20220118204326.2169-5-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Ensures that when dynamic events requests a match with arguments that
they match what is in the user_event.
Link: https://lkml.kernel.org/r/20220118204326.2169-4-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Addes print_fmt format generation for basic types that are supported for
user processes. Only supports sizes that are the same on 32 and 64 bit.
Link: https://lkml.kernel.org/r/20220118204326.2169-3-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Minimal support for interacting with dynamic events, trace_event and
ftrace. Core outline of flow between user process, ioctl and trace_event
APIs.
User mode processes that wish to use trace events to get data into
ftrace, perf, eBPF, etc are limited to uprobes today. The user events
features enables an ABI for user mode processes to create and write to
trace events that are isolated from kernel level trace events. This
enables a faster path for tracing from user mode data as well as opens
managed code to participate in trace events, where stub locations are
dynamic.
User processes often want to trace only when it's useful. To enable this
a set of pages are mapped into the user process space that indicate the
current state of the user events that have been registered. User
processes can check if their event is hooked to a trace/probe, and if it
is, emit the event data out via the write() syscall.
Two new files are introduced into tracefs to accomplish this:
user_events_status - This file is mmap'd into participating user mode
processes to indicate event status.
user_events_data - This file is opened and register/delete ioctl's are
issued to create/open/delete trace events that can be used for tracing.
The typical scenario is on process start to mmap user_events_status. Processes
then register the events they plan to use via the REG ioctl. The ioctl reads
and updates the passed in user_reg struct. The status_index of the struct is
used to know the byte in the status page to check for that event. The
write_index of the struct is used to describe that event when writing out to
the fd that was used for the ioctl call. The data must always include this
index first when writing out data for an event. Data can be written either by
write() or by writev().
For example, in memory:
int index;
char data[];
Psuedo code example of typical usage:
struct user_reg reg;
int page_fd = open("user_events_status", O_RDWR);
char *page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0);
close(page_fd);
int data_fd = open("user_events_data", O_RDWR);
reg.size = sizeof(reg);
reg.name_args = (__u64)"test";
ioctl(data_fd, DIAG_IOCSREG, ®);
int status_id = reg.status_index;
int write_id = reg.write_index;
struct iovec io[2];
io[0].iov_base = &write_id;
io[0].iov_len = sizeof(write_id);
io[1].iov_base = payload;
io[1].iov_len = sizeof(payload);
if (page_data[status_id])
writev(data_fd, io, 2);
User events are also exposed via the dynamic_events tracefs file for
both create and delete. Current status is exposed via the user_events_status
tracefs file.
Simple example to register a user event via dynamic_events:
echo u:test >> dynamic_events
cat dynamic_events
u:test
If an event is hooked to a probe, the probe hooked shows up:
echo 1 > events/user_events/test/enable
cat user_events_status
1:test # Used by ftrace
Active: 1
Busy: 1
Max: 4096
If an event is not hooked to a probe, no probe status shows up:
echo 0 > events/user_events/test/enable
cat user_events_status
1:test
Active: 1
Busy: 0
Max: 4096
Users can describe the trace event format via the following format:
name[:FLAG1[,FLAG2...] [field1[;field2...]]
Each field has the following format:
type name
Example for char array with a size of 20 named msg:
echo 'u:detailed char[20] msg' >> dynamic_events
cat dynamic_events
u:detailed char[20] msg
Data offsets are based on the data written out via write() and will be
updated to reflect the correct offset in the trace_event fields. For dynamic
data it is recommended to use the new __rel_loc data type. This type will be
the same as __data_loc, but the offset is relative to this entry. This allows
user_events to not worry about what common fields are being inserted before
the data.
The above format is valid for both the ioctl and the dynamic_events file.
Link: https://lkml.kernel.org/r/20220118204326.2169-2-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Use the sched_switch function to save both the wakee and the waker comms
in the saved cmdlines list when sched_wakeup is done.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, synthetic event command error strings are restricted to a
length of MAX_FILTER_STR_VAL (256), which is too short for some
commands already seen in the wild (with cmd strings longer than that
showing up truncated in err_log).
Remove the restriction so that no synthetic event command error string
is ever truncated.
Link: https://lkml.kernel.org/r/0376692396a81d0b795127c66ea92ca5bf60f481.1643399022.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, hist trigger command error strings are restricted to a
length of MAX_FILTER_STR_VAL (256), which is too short for some
commands already seen in the wild (with cmd strings longer than that
showing up truncated in err_log).
Remove the restriction so that no hist trigger command error string is
ever truncated.
Link: https://lkml.kernel.org/r/0f9d46407222eaf6632cd3b417bc50a11f401b71.1643399022.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, tracing_log_err.cmd strings are restricted to a length of
MAX_FILTER_STR_VAL (256), which is too short for some commands already
seen in the wild (with cmd strings longer than that showing up
truncated).
Remove the restriction so that no command string is ever truncated.
Link: https://lkml.kernel.org/r/ca965f23256b350ebd94b3dc1a319f28e8267f5f.1643319703.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If seccomp tries to kill a process, it should never see that process
again. To enforce this proactively, switch the mode to something
impossible. If encountered: WARN, reject all syscalls, and attempt to
kill the process again even harder.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Fixes: 8112c4f140 ("seccomp: remove 2-phase API")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Fatal SIGSYS signals (i.e. seccomp RET_KILL_* syscall filter actions)
were not being delivered to ptraced pid namespace init processes. Make
sure the SIGNAL_UNKILLABLE doesn't get set for these cases.
Reported-by: Robert Święcki <robert@swiecki.net>
Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/lkml/878rui8u4a.fsf@email.froward.int.ebiederm.org
bpf_prog_pack causes build error with powerpc ppc64_defconfig:
kernel/bpf/core.c:830:23: error: variably modified 'bitmap' at file scope
830 | unsigned long bitmap[BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)];
| ^~~~~~
This is because the marco expands as:
unsigned long bitmap[((((((1UL) << (16 + __pte_index_size)) / (1 << 6))) \
+ ((sizeof(long) * 8)) - 1) / ((sizeof(long) * 8)))];
where __pte_index_size is a global variable.
Fix it by turning bitmap into a 0-length array.
Fixes: 57631054fa ("bpf: Introduce bpf_prog_pack allocator")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220211024939.2962537-1-song@kernel.org