Zac's syzbot crafted a bpf prog that exposed two bugs in may_goto.
The 1st bug is the way may_goto is patched. When offset is negative
it should be patched differently.
The 2nd bug is in the verifier:
when current state may_goto_depth is equal to visited state may_goto_depth
it means there is an actual infinite loop. It's not correct to prune
exploration of the program at this point.
Note, that this check doesn't limit the program to only one may_goto insn,
since 2nd and any further may_goto will increment may_goto_depth only
in the queued state pushed for future exploration. The current state
will have may_goto_depth == 0 regardless of number of may_goto insns
and the verifier has to explore the program until bpf_exit.
Fixes: 011832b97b ("bpf: Introduce may_goto instruction")
Reported-by: Zac Ecob <zacecob@protonmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Closes: https://lore.kernel.org/bpf/CAADnVQL-15aNp04-cyHRn47Yv61NXfYyhopyZtUyxNojUZUXpA@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20240619235355.85031-1-alexei.starovoitov@gmail.com
After the rework of "Parallel CPU bringup", the cmdline "nosmp" and
"maxcpus=0" parameters are not working anymore. These parameters set
setup_max_cpus to zero and that's handed to bringup_nonboot_cpus().
The code there does a decrement before checking for zero, which brings it
into the negative space and brings up all CPUs.
Add a zero check at the beginning of the function to prevent this.
[ tglx: Massaged change log ]
Fixes: 18415f33e2 ("cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE")
Fixes: 06c6796e03 ("cpu/hotplug: Fix off by one in cpuhp_bringup_mask()")
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240618081336.3996825-1-chenhuacai@loongson.cn
Modify the comment formatting in irq_find_matching_fwspec function to
enhance code readability and maintain consistency.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Shivamurthy Shastri <shivamurthy.shastri@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240614102403.13610-2-shivamurthy.shastri@linutronix.de
A dying worker is first moved from pool->workers to pool->dying_workers
in set_worker_dying() and removed from pool->dying_workers in
detach_dying_workers(). The whole procedure is in the some lock context
of wq_pool_attach_mutex.
So pool->dying_workers is useless, just remove it and keep the dying
worker in pool->workers after set_worker_dying() and remove it in
detach_dying_workers() with wq_pool_attach_mutex held.
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The code to kick off the destruction of workers is now in a process
context (idle_cull_fn()), and the detaching of a worker is not required
to be inside the worker thread now, so just do the detaching directly
in idle_cull_fn().
wake_dying_workers() is renamed to detach_dying_workers() and the unneeded
wakeup in wake_dying_workers() is also removed.
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
So that when the rescuer is woken up next time, it will not interrupt
the last working cpu which might be busy on other crucial works but
have nothing to do with the rescuer's incoming works.
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The code to kick off the destruction of workers is now in a process
context (idle_cull_fn()), so kthread_stop() can be used in the process
context to replace the work of pool->detach_completion.
The wakeup in wake_dying_workers() is unneeded after this change, but it
is harmless, jut keep it here until next patch renames wake_dying_workers()
rather than renaming it again and again.
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Share relocation implementation with the kernel. As part of this,
we also need the type/string iteration functions so also share
btf_iter.c file. Relocation code in kernel and userspace is identical
save for the impementation of the reparenting of split BTF to the
relocated base BTF and retrieval of the BTF header from "struct btf";
these small functions need separate user-space and kernel implementations
for the separate "struct btf"s they operate upon.
One other wrinkle on the kernel side is we have to map .BTF.ids in
modules as they were generated with the type ids used at BTF encoding
time. btf_relocate() optionally returns an array mapping from old BTF
ids to relocated ids, so we use that to fix up these references where
needed for kfuncs.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240620091733.1967885-5-alan.maguire@oracle.com
...as this will allow split BTF modules with a base BTF
representation (rather than the full vmlinux BTF at time of
BTF encoding) to resolve their references to kernel types in a
way that is more resilient to small changes in kernel types.
This will allow modules that are not built every time the kernel
is to provide more resilient BTF, rather than have it invalidated
every time BTF ids for core kernel types change.
Fields are ordered to avoid holes in struct module.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240620091733.1967885-3-alan.maguire@oracle.com
The BPF ring buffer internally is implemented as a power-of-2 sized circular
buffer, with two logical and ever-increasing counters: consumer_pos is the
consumer counter to show which logical position the consumer consumed the
data, and producer_pos which is the producer counter denoting the amount of
data reserved by all producers.
Each time a record is reserved, the producer that "owns" the record will
successfully advance producer counter. In user space each time a record is
read, the consumer of the data advanced the consumer counter once it finished
processing. Both counters are stored in separate pages so that from user
space, the producer counter is read-only and the consumer counter is read-write.
One aspect that simplifies and thus speeds up the implementation of both
producers and consumers is how the data area is mapped twice contiguously
back-to-back in the virtual memory, allowing to not take any special measures
for samples that have to wrap around at the end of the circular buffer data
area, because the next page after the last data page would be first data page
again, and thus the sample will still appear completely contiguous in virtual
memory.
Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for
book-keeping the length and offset, and is inaccessible to the BPF program.
Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ`
for the BPF program to use. Bing-Jhong and Muhammad reported that it is however
possible to make a second allocated memory chunk overlapping with the first
chunk and as a result, the BPF program is now able to edit first chunk's
header.
For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size
of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to
bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in
[0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets
allocate a chunk B with size 0x3000. This will succeed because consumer_pos
was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask`
check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able
to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned
earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data
pages. This means that chunk B at [0x4000,0x4008] is chunk A's header.
bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then
locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk
B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong
page and could cause a crash.
Fix it by calculating the oldest pending_pos and check whether the range
from the oldest outstanding record to the newest would span beyond the ring
buffer size. If that is the case, then reject the request. We've tested with
the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh)
before/after the fix and while it seems a bit slower on some benchmarks, it
is still not significantly enough to matter.
Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Reported-by: Muhammad Ramdhan <ramdhan@starlabs.sg>
Co-developed-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240621140828.18238-1-daniel@iogearbox.net
When the following program is processed by the verifier:
L1: may_goto L2
goto L1
L2: w0 = 0
exit
the may_goto insn is first converted to:
L1: r11 = *(u64 *)(r10 -8)
if r11 == 0x0 goto L2
r11 -= 1
*(u64 *)(r10 -8) = r11
goto L1
L2: w0 = 0
exit
then later as the last step the verifier inserts:
*(u64 *)(r10 -8) = BPF_MAX_LOOPS
as the first insn of the program to initialize loop count.
When the first insn happens to be a branch target of some jmp the
bpf_patch_insn_data() logic will produce:
L1: *(u64 *)(r10 -8) = BPF_MAX_LOOPS
r11 = *(u64 *)(r10 -8)
if r11 == 0x0 goto L2
r11 -= 1
*(u64 *)(r10 -8) = r11
goto L1
L2: w0 = 0
exit
because instruction patching adjusts all jmps and calls, but for this
particular corner case it's incorrect and the L1 label should be one
instruction down, like:
*(u64 *)(r10 -8) = BPF_MAX_LOOPS
L1: r11 = *(u64 *)(r10 -8)
if r11 == 0x0 goto L2
r11 -= 1
*(u64 *)(r10 -8) = r11
goto L1
L2: w0 = 0
exit
and that's what this patch is fixing.
After bpf_patch_insn_data() call adjust_jmp_off() to adjust all jmps
that point to newly insert BPF_ST insn to point to insn after.
Note that bpf_patch_insn_data() cannot easily be changed to accommodate
this logic, since jumps that point before or after a sequence of patched
instructions have to be adjusted with the full length of the patch.
Conceptually it's somewhat similar to "insert" of instructions between other
instructions with weird semantics. Like "insert" before 1st insn would require
adjustment of CALL insns to point to newly inserted 1st insn, but not an
adjustment JMP insns that point to 1st, yet still adjusting JMP insns that
cross over 1st insn (point to insn before or insn after), hence use simple
adjust_jmp_off() logic to fix this corner case. Ideally bpf_patch_insn_data()
would have an auxiliary info to say where 'the start of newly inserted patch
is', but it would be too complex for backport.
Fixes: 011832b97b ("bpf: Introduce may_goto instruction")
Reported-by: Zac Ecob <zacecob@protonmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Closes: https://lore.kernel.org/bpf/CAADnVQJ_WWx8w4b=6Gc2EpzAjgv+6A0ridnMz2TvS2egj4r3Gw@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20240619011859.79334-1-alexei.starovoitov@gmail.com
The new generic LSM hook security_file_post_open() was recently added
to the LSM framework in commit 8f46ff5767 ("security: Introduce
file_post_open hook"). Let's proactively add this generic LSM hook to
the sleepable_lsm_hooks BTF ID set, because I can't see there being
any strong reasons not to, and it's only a matter of time before
someone else comes around and asks for it to be there.
security_file_post_open() is inherently sleepable as it's purposely
situated in the kernel that allows LSMs to directly read out the
contents of the backing file if need be. Additionally, it's called
directly after security_file_open(), and that LSM hook in itself
already exists in the sleepable_lsm_hooks BTF ID set.
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240618192923.379852-1-mattbobrowski@google.com
This reverts [1] and changes return value for bpf_session_cookie
in bpf selftests. Having long * might lead to problems on 32-bit
architectures.
Fixes: 2b8dd87332 ("bpf: Make bpf_session_cookie() kfunc return long *")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240619081624.1620152-1-jolsa@kernel.org
The function returns the idle calls counter for the current cpu and
therefore usually isn't what the caller wants. It is unnused since
commit 466a2b42d6 ("cpufreq: schedutil: Use idle_calls counter of the
remote CPU")
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240617161615.49309-1-christian.loehle@arm.com
Commit cf8e865810 ("arch: Remove Itanium (IA-64) architecture")
removed the only definition of macro _TIF_MCA_INIT, so kdb_curr_task()
is actually the same as curr_task() now and becomes redundant.
Let's remove the definition of kdb_curr_task() and replace remaining
calls with curr_task().
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240620142132.157518-1-zhengzengkai@huawei.com
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
The function kdb_position_cursor() takes in a "prompt" parameter but
never uses it. This doesn't _really_ matter since all current callers
of the function pass the same value and it's a global variable, but
it's a bit ugly. Let's clean it up.
Found by code inspection. This patch is expected to functionally be a
no-op.
Fixes: 09b3598942 ("kdb: Use format-strings rather than '\0' injection in kdb_read()")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240528071144.1.I0feb49839c6b6f4f2c4bf34764f5e95de3f55a66@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
When -Wformat-security is not disabled, using a string pointer
as a format causes a warning:
kernel/debug/kdb/kdb_io.c: In function 'kdb_read':
kernel/debug/kdb/kdb_io.c:365:36: error: format not a string literal and no format arguments [-Werror=format-security]
365 | kdb_printf(kdb_prompt_str);
| ^~~~~~~~~~~~~~
kernel/debug/kdb/kdb_io.c: In function 'kdb_getstr':
kernel/debug/kdb/kdb_io.c:456:20: error: format not a string literal and no format arguments [-Werror=format-security]
456 | kdb_printf(kdb_prompt_str);
| ^~~~~~~~~~~~~~
Use an explcit "%s" format instead.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5d5314d679 ("kdb: core for kgdb back end (1 of 2)")
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240528121154.3662553-1-arnd@kernel.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
This new_n is defined in the start of this function.
Its value is overwritten by `new_n = min(n, log->len_total);`
a couple lines before my change,
rendering the shadow declaration unnecessary.
Signed-off-by: Rafael Passos <rafael@rcpassos.me>
Link: https://lore.kernel.org/r/20240615022641.210320-4-rafael@rcpassos.me
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fixes a compiler warning. The __bpf_free_used_btfs function
was taking an extra unused struct bpf_prog_aux *aux param
Signed-off-by: Rafael Passos <rafael@rcpassos.me>
Link: https://lore.kernel.org/r/20240615022641.210320-3-rafael@rcpassos.me
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fixes a compiler warning. the bpf_jit_binary_pack_finalize function
was taking an extra bpf_prog parameter that went unused.
This removves it and updates the callers accordingly.
Signed-off-by: Rafael Passos <rafael@rcpassos.me>
Link: https://lore.kernel.org/r/20240615022641.210320-2-rafael@rcpassos.me
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It's confusing to inspect 'prog->aux->tail_call_reachable' with drgn[0],
when bpf prog has tail call but 'tail_call_reachable' is false.
This patch corrects 'tail_call_reachable' when bpf prog has tail call.
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
Link: https://lore.kernel.org/r/20240610124224.34673-2-hffilwlqm@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
1e7962114c ("bnxt_en: Restore PTP tx_avail count in case of skb_pad() error")
165f87691a ("bnxt_en: add timestamping statistics support")
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- ipv6: bring NLM_DONE out to a separate recv() again
Current release - new code bugs:
- wifi: cfg80211: wext: set ssids=NULL for passive scans via old wext API
Previous releases - regressions:
- wifi: mac80211: fix monitor channel setting with chanctx emulation
(probably most awaited of the fixes in this PR, tracked by Thorsten)
- usb: ax88179_178a: bring back reset on init, if PHY is disconnected
- bpf: fix UML x86_64 compile failure with BPF
- bpf: avoid splat in pskb_pull_reason(), sanity check added can be hit
with malicious BPF
- eth: mvpp2: use slab_build_skb() for packets in slab, driver was
missed during API refactoring
- wifi: iwlwifi: add missing unlock of mvm mutex
Previous releases - always broken:
- ipv6: add a number of missing null-checks for in6_dev_get(), in case
IPv6 disabling races with the datapath
- bpf: fix reg_set_min_max corruption of fake_reg
- sched: act_ct: add netns as part of the key of tcf_ct_flow_table
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmZ0VAAACgkQMUZtbf5S
IrtMnQ//b0YNnC2PduSn6fDnDamyZW3vjqwXQ6K0DsgSzEIiAtEd6LbkPN4vAcpp
k634dHseQjTuAcsTZxisIs32nC2up9q/t/+6XD8VSaQbSzKhB+rFDviUxfGJWjt4
MZRK0mDcmib2tXAEfYnMi+QjvC5S+ZSHLpemDdzTI3AyKcPynqLcM1PcC0CGS5GS
6MpvRAtEgTAkXd2rc4WAbOcmd8NLJN80f/srRDXFVqrXy8f6adaULvCvzSXSiQy8
peUaPhI6BYNBL2Tzjp3D+Nh54ks3Ol8MeqaGYsuJHtgd+/I+/YWzYc74an8BuEwR
C6fszbH7i64WaQUI5ZhX/1Da0CTesNxzsPgeAFP3qEe20r53vN0NiFjRrHpO02El
lew9Hrx27Zzt9k3eSdtC3GGj/S93PYjE5RRuSClQrW8fUqETZ8dFocbrNAraHGMv
rDOqIT3XMg/BIBw9ADxizAgsrFC0QbBShQPs2iMuuVwmrWj9DEC0GKlt3KxyPT36
fl4w3gGRdIDz/ZTXKQZtta3Z4ckaKiTw8jbNXxteBDEHErFYYND+4XDzK/uIqHCe
0IoVWVUnhVfKOuGBIDGIFDsAvbgqTcVd+wZTB4SxZsbXISzpfYLcrM4qXf4YQNNb
MeIQg0Zwjm+xdLGXVCt8wBBGmj4EK9uMa3wjYu3lGREgxyH42eI=
=Lb9b
-----END PGP SIGNATURE-----
Merge tag 'net-6.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless, bpf and netfilter.
Happy summer solstice! The line count is a bit inflated by a selftest
and update to a driver's FW interface header, in reality this is
slightly below average for us. We are expecting one driver fix from
Intel, but there are no big known issues.
Current release - regressions:
- ipv6: bring NLM_DONE out to a separate recv() again
Current release - new code bugs:
- wifi: cfg80211: wext: set ssids=NULL for passive scans via old wext API
Previous releases - regressions:
- wifi: mac80211: fix monitor channel setting with chanctx emulation
(probably most awaited of the fixes in this PR, tracked by Thorsten)
- usb: ax88179_178a: bring back reset on init, if PHY is disconnected
- bpf: fix UML x86_64 compile failure with BPF
- bpf: avoid splat in pskb_pull_reason(), sanity check added can be hit
with malicious BPF
- eth: mvpp2: use slab_build_skb() for packets in slab, driver was
missed during API refactoring
- wifi: iwlwifi: add missing unlock of mvm mutex
Previous releases - always broken:
- ipv6: add a number of missing null-checks for in6_dev_get(), in case
IPv6 disabling races with the datapath
- bpf: fix reg_set_min_max corruption of fake_reg
- sched: act_ct: add netns as part of the key of tcf_ct_flow_table"
* tag 'net-6.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits)
net: usb: rtl8150 fix unintiatilzed variables in rtl8150_get_link_ksettings
selftests: virtio_net: add forgotten config options
bnxt_en: Restore PTP tx_avail count in case of skb_pad() error
bnxt_en: Set TSO max segs on devices with limits
bnxt_en: Update firmware interface to 1.10.3.44
net: stmmac: Assign configured channel value to EXTTS event
net: do not leave a dangling sk pointer, when socket creation fails
net/tcp_ao: Don't leak ao_info on error-path
ice: Fix VSI list rule with ICE_SW_LKUP_LAST type
ipv6: bring NLM_DONE out to a separate recv() again
selftests: add selftest for the SRv6 End.DX6 behavior with netfilter
selftests: add selftest for the SRv6 End.DX4 behavior with netfilter
netfilter: move the sysctl nf_hooks_lwtunnel into the netfilter core
seg6: fix parameter passing when calling NF_HOOK() in End.DX4 and End.DX6 behaviors
netfilter: ipset: Fix suspicious rcu_dereference_protected()
selftests: openvswitch: Set value to nla flags.
octeontx2-pf: Fix linking objects into multiple modules
octeontx2-pf: Add error handling to VLAN unoffload handling
virtio_net: fixing XDP for fully checksummed packets handling
virtio_net: checksum offloading handling fix
...
Current try_to_grab_pending() activates the inactive item and
subsequently treats it as though it were a standard activated item.
This approach prevents duplicating handling logic for both active and
inactive items, yet the premature activation of an inactive item
triggers trace_workqueue_activate_work(), yielding an unintended user
space visible side effect.
And the unnecessary increment of the nr_active, which is not a simple
counter now, followed by a counteracted decrement, is inefficient and
complicates the code.
Just remove the nr_active manipulation code in grabbing inactive items.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The "cpuset.cpus.exclusive.effective" value is currently limited to a
subset of its "cpuset.cpus". This makes the exclusive CPUs distribution
hierarchy subsumed within the larger "cpuset.cpus" hierarchy. We have to
decide on what CPUs are used locally and what CPUs can be passed down as
exclusive CPUs down the hierarchy and combine them into "cpuset.cpus".
The advantage of the current scheme is to have only one hierarchy to
worry about. However, it make it harder to use as all the "cpuset.cpus"
values have to be properly set along the way down to the designated remote
partition root. It also makes it more cumbersome to find out what CPUs
can be used locally.
Make creation of remote partition simpler by breaking the
dependency of "cpuset.cpus.exclusive" on "cpuset.cpus" and make
them independent entities. Now we have two separate hierarchies -
one for setting "cpuset.cpus.effective" and the other one for setting
"cpuset.cpus.exclusive.effective". We may not need to set "cpuset.cpus"
when we activate a partition root anymore.
Also update Documentation/admin-guide/cgroup-v2.rst and cpuset.c comment
to document this change.
Suggested-by: Petr Malat <oss@malat.biz>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The CS_CPU_EXCLUSIVE flag is currently set whenever cpuset.cpus.exclusive
is set to make sure that the exclusivity test will be run to ensure its
exclusiveness. At the same time, this flag can be changed whenever the
partition root state is changed. For example, the CS_CPU_EXCLUSIVE flag
will be reset whenever a partition root becomes invalid. This makes
using CS_CPU_EXCLUSIVE to ensure exclusiveness a bit fragile.
The current scheme also makes setting up a cpuset.cpus.exclusive
hierarchy to enable remote partition harder as cpuset.cpus.exclusive
cannot overlap with any cpuset.cpus of sibling cpusets if their
cpuset.cpus.exclusive aren't set.
Solve these issues by deferring the setting of CS_CPU_EXCLUSIVE flag
until the cpuset become a valid partition root while adding new checks
in validate_change() to ensure that cpuset.cpus.exclusive of sibling
cpusets cannot overlap.
An additional check is also added to validate_change() to make sure that
cpuset.cpus of one cpuset cannot be a subset of cpuset.cpus.exclusive
of a sibling cpuset to avoid the problem that none of those CPUs will
be available when these exclusive CPUs are extracted out to a newly
enabled partition root. The Documentation/admin-guide/cgroup-v2.rst
file is updated to document the new constraints.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since commit 181c8e091a ("cgroup/cpuset: Introduce remote partition"),
a remote partition can be created underneath a non-partition root cpuset
as long as its exclusive_cpus are set to distribute exclusive CPUs down
to its children. The generate_sched_domains() function, however, doesn't
take into account this new behavior and hence will fail to create the
sched domain needed for a remote root (non-isolated) partition.
There are two issues related to remote partition support. First of
all, generate_sched_domains() has a fast path that is activated if
root_load_balance is true and top_cpuset.nr_subparts is non-zero. The
later condition isn't quite correct for remote partitions as nr_subparts
just shows the number of local child partitions underneath it. There
can be no local child partition under top_cpuset even if there are
remote partitions further down the hierarchy. Fix that by checking
for subpartitions_cpus which contains exclusive CPUs allocated to both
local and remote partitions.
Secondly, the valid partition check for subtree skipping in the csa[]
generation loop isn't enough as remote partition does not need to
have a partition root parent. Fix this problem by breaking csa[] array
generation loop of generate_sched_domains() into v1 and v2 specific parts
and checking a cpuset's exclusive_cpus before skipping its subtree in
the v2 case.
Also simplify generate_sched_domains() for cgroup v2 as only
non-isolating partition roots should be included in building the cpuset
array and none of the v1 scheduling attributes other than a different
way to create an isolated partition are supported.
Fixes: 181c8e091a ("cgroup/cpuset: Introduce remote partition")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Restrict gen-API tests for synthetic and kprobe events to only be built as
modules, as they generate dynamic events that cannot be removed, causing
ftracetest and startup selftests to fail.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmZy6HobHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bqtYIAMLap5hV/w9Gh5b32hOF
/FS/oqGTIs8wfvZq2PBOruFmmvhrqjvpbZVTU9aNUr2lywYALM+jgO3ElSLIoZdz
5s8Wsnic5a2DvG23r/S5u80f85Gxy14e5fvCcCT/3Bvw1ip65XdMXqUwh9oM4zHh
i8rmeIIJmVspHD9bxTREsosB8/LKvSx6GNzLrHwHyL5UepDgj/r5hLvyEuY3fyuo
hazbvsZbHi+aduAS3it+BnhMoFLgLzqrYi8dl1fPY+xmnGI2LZZkds1mfD1JmjBB
AVm9gOWKpW+HHoxeMEMcAs8mhithR7VGA2V2zdsOmRzndytKhUghHWvgcrBZWvl6
D5Y=
=BNpD
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fix from Masami Hiramatsu:
- Restrict gen-API tests for synthetic and kprobe events to only be
built as modules, as they generate dynamic events that cannot be
removed, causing ftracetest and startup selftests to fail
* tag 'probes-fixes-v6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Build event generation tests only as modules
cgroup_exit() needs to do this only if the exiting task is a leader and it
is not the last live thread. The patch doesn't use delay_group_leader(),
atomic_read(signal->live) matches the code css_task_iter_advance() more.
cgroup_release() can now check list_empty(task->cg_list) before it takes
css_set_lock and calls ss_set_skip_task_iters().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This commit adds the get_completed_synchronize_srcu() and the
same_state_synchronize_srcu() functions. The first returns a cookie
that is always interpreted as corresponding to an expired grace period.
The second does an equality comparison of a pair of cookies.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have polled SRCU grace periods, a grace period can be
started by start_poll_synchronize_srcu() as well as call_srcu(),
synchronize_srcu(), and synchronize_srcu_expedited(). This commit
therefore calls out this new start_poll_synchronize_srcu() possibility
in the comment on the WARN_ON().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Interrupts are enabled in srcu_gp_end(), so this commit switches from
spin_lock_irqsave_rcu_node() and spin_unlock_irqrestore_rcu_node()
to spin_lock_irq_rcu_node() and spin_unlock_irq_rcu_node().
Link: https://lore.kernel.org/all/febb13ab-a4bb-48b4-8e97-7e9f7749e6da@moroto.mountain/
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Interrupts are enabled in rcu_gp_init(), so this commit switches from
local_irq_save() and local_irq_restore() to local_irq_disable() and
local_irq_enable().
Link: https://lore.kernel.org/all/febb13ab-a4bb-48b4-8e97-7e9f7749e6da@moroto.mountain/
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In the synchronize_rcu() common case, we will have less than
SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
is pointless just to free the last injected wait head since at that point,
all the users have already been awakened.
Introduce a new counter to track this and prevent the wakeup in the
common case.
[ paulmck: Remove atomic_dec_return_release in cannot-happen state. ]
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmZwh44UHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNRBA//Q09J/SADHi63fjpStx+Gvo5h6TbM
L4gqYsjxpi7CfXFwlBtFRjk9Q0osRDxbDWTuZ8gMcJONlRdHpZFil2gYSEacImsn
tAkrQpV32U1oNua+kgoIkQTHwNIKjA9odYZ4pyJ0AZvnB5Z62B841r8GAaTADg++
fGOuCBYZeuioCAjPUN2KZtkCKdhiu823Gwe2z9U6SJyCdPqRFjpBuumDoNvCTrCB
UJuc5DqWSNk2rZXZQG6RSLeOOZZwRf9s2ATU96T/9Lp0m6qqxPPisHkWscjhx5Ve
W7z2IWGFrNzJ8ABKwBK/NUMQbs3WzsepyPqZdoo//PkhMjQlfb+5iPitJWM6qmdM
6jgj2HkDzX2OtR9u6VOcOKKwz4NQnf4JcHRUDjq8vQ3eKYOTcDLx4VR8O/Ullmhf
pZL4klNXpBrw7DLYurTlpbm9jUmMCev9DvuSYJmyRjq7jA+8Cph6+clGriIbljqn
9hCqSnbufDxySwB0unYu9zwnC4bN+Yzcgr4qYFoA+zdj5eYloaJvPhwOh6MPsQaO
DJlCt6Wfw4SqD3afxaJnzw4/SBRuPA8ISoxTXVJUg7Q+NfUI8HBDO4YihiqJ7cm0
yvD0mFvweJVEpX2slDyob58xYgkmL8TaIPErJ9A/EO30W0nm+nQzXDR+cOa9VqAc
txcTscOv5YMLLMk=
=nYky
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20240617' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm fix from Paul Moore:
"A single LSM/IMA patch to fix a problem caused by sleeping while in a
RCU critical section"
* tag 'lsm-pr-20240617' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
ima: Avoid blocking in RCU read-side critical section
Mainly MM singleton fixes. And a couple of ocfs2 regression fixes.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZnCEQAAKCRDdBJ7gKXxA
jmgSAQDk3BYs1n67cnwx/Zi04yMYDyfYTCYg2udPfT2a+GpmbwD+N5dJd/vCztXH
5eLpP11xd/yr2+I9FefyZeUuA80KtgQ=
=2agY
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-06-17-11-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Mainly MM singleton fixes. And a couple of ocfs2 regression fixes"
* tag 'mm-hotfixes-stable-2024-06-17-11-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
kcov: don't lose track of remote references during softirqs
mm: shmem: fix getting incorrect lruvec when replacing a shmem folio
mm/debug_vm_pgtable: drop RANDOM_ORVALUE trick
mm: fix possible OOB in numa_rebuild_large_mapping()
mm/migrate: fix kernel BUG at mm/compaction.c:2761!
selftests: mm: make map_fixed_noreplace test names stable
mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC
mm: mmap: allow for the maximum number of bits for randomizing mmap_base by default
gcov: add support for GCC 14
zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING
mm: huge_memory: fix misused mapping_large_folio_support() for anon folios
lib/alloc_tag: fix RCU imbalance in pgalloc_tag_get()
lib/alloc_tag: do not register sysctl interface when CONFIG_SYSCTL=n
MAINTAINERS: remove Lorenzo as vmalloc reviewer
Revert "mm: init_mlocked_on_free_v3"
mm/page_table_check: fix crash on ZONE_DEVICE
gcc: disable '-Warray-bounds' for gcc-9
ocfs2: fix NULL pointer dereference in ocfs2_abort_trigger()
ocfs2: fix NULL pointer dereference in ocfs2_journal_dirty()
In coerce_subreg_to_size_sx(), for the case where upper
sign extension bits are the same for smax32 and smin32
values, we missed to setup properly. This is especially
problematic if both smax32 and smin32's sign extension
bits are 1.
The following is a simple example illustrating the inconsistent
verifier states due to missed var_off:
0: (85) call bpf_get_prandom_u32#7 ; R0_w=scalar()
1: (bf) r3 = r0 ; R0_w=scalar(id=1) R3_w=scalar(id=1)
2: (57) r3 &= 15 ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf))
3: (47) r3 |= 128 ; R3_w=scalar(smin=umin=smin32=umin32=128,smax=umax=smax32=umax32=143,var_off=(0x80; 0xf))
4: (bc) w7 = (s8)w3
REG INVARIANTS VIOLATION (alu): range bounds violation u64=[0xffffff80, 0x8f] s64=[0xffffff80, 0x8f]
u32=[0xffffff80, 0x8f] s32=[0x80, 0xffffff8f] var_off=(0x80, 0xf)
The var_off=(0x80, 0xf) is not correct, and the correct one should
be var_off=(0xffffff80; 0xf) since from insn 3, we know that at
insn 4, the sign extension bits will be 1. This patch fixed this
issue by setting var_off properly.
Fixes: 8100928c88 ("bpf: Support new sign-extension mov insns")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240615174632.3995278-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Zac reported a verification failure and Alexei reproduced the issue
with a simple reproducer ([1]). The verification failure is due to missed
setting for var_off.
The following is the reproducer in [1]:
0: R1=ctx() R10=fp0
0: (71) r3 = *(u8 *)(r10 -387) ;
R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R10=fp0
1: (bc) w7 = (s8)w3 ;
R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f))
2: (36) if w7 >= 0x2533823b goto pc-3
mark_precise: frame0: last_idx 2 first_idx 0 subseq_idx -1
mark_precise: frame0: regs=r7 stack= before 1: (bc) w7 = (s8)w3
mark_precise: frame0: regs=r3 stack= before 0: (71) r3 = *(u8 *)(r10 -387)
2: R7_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=127,var_off=(0x0; 0x7f))
3: (b4) w0 = 0 ; R0_w=0
4: (95) exit
Note that after insn 1, the var_off for R7 is (0x0; 0x7f). This is not correct
since upper 24 bits of w7 could be 0 or 1. So correct var_off should be
(0x0; 0xffffffff). Missing var_off setting in set_sext32_default_val() caused later
incorrect analysis in zext_32_to_64(dst_reg) and reg_bounds_sync(dst_reg).
To fix the issue, set var_off correctly in set_sext32_default_val(). The correct
reg state after insn 1 becomes:
1: (bc) w7 = (s8)w3 ;
R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
R7_w=scalar(smin=0,smax=umax=0xffffffff,smin32=-128,smax32=127,var_off=(0x0; 0xffffffff))
and at insn 2, the verifier correctly determines either branch is possible.
[1] https://lore.kernel.org/bpf/CAADnVQLPU0Shz7dWV4bn2BgtGdxN3uFHPeobGBA72tpg5Xoykw@mail.gmail.com/
Fixes: 8100928c88 ("bpf: Support new sign-extension mov insns")
Reported-by: Zac Ecob <zacecob@protonmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240615174626.3994813-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
ACPI MADT doesn't allow to offline a CPU after it has been woken up.
Currently, CPU hotplug is prevented based on the confidential computing
attribute which is set for Intel TDX. But TDX is not the only possible user of
the wake up method. Any platform that uses ACPI MADT wakeup method cannot
offline CPU.
Disable CPU offlining on ACPI MADT wakeup enumeration.
This has no visible effects for users: currently, TDX guest is the only platform
that uses the ACPI MADT wakeup method.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-5-kirill.shutemov@linux.intel.com
The ACPI MADT mailbox wakeup method doesn't allow to offline a CPU after
it has been woken up.
Currently, offlining is prevented based on the confidential computing attribute
which is set for Intel TDX. But TDX is not the only possible user of the wake up
method. The MADT wakeup can be implemented outside of a confidential computing
environment. Offline support is a property of the wakeup method, not the CoCo
implementation.
Introduce cpu_hotplug_disable_offlining() that can be called to indicate that
CPU offlining should be disabled.
This function is going to replace CC_ATTR_HOTPLUG_DISABLED for ACPI MADT wakeup
method.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tao Liu <ltao@redhat.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-4-kirill.shutemov@linux.intel.com
Domain creation functions use __irq_domain_add(). With the introduction
of irq_domain_instantiate(), __irq_domain_add() becomes obsolete.
In order to fully remove __irq_domain_add(), convert domain
creation function to irq_domain_instantiate()
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240614173232.1184015-19-herve.codina@bootlin.com
The current API functions create an irq_domain and also publish this
newly created to domain. Once an irq_domain is published, consumers can
request IRQ in order to use them.
Some interrupt controller drivers have to perform some more operations
with the created irq_domain in order to have it ready to be used.
For instance:
- Allocate generic irq chips with irq_alloc_domain_generic_chips()
- Retrieve the generic irq chips with irq_get_domain_generic_chip()
- Initialize retrieved chips: set register base address and offsets,
set several hooks such as irq_mask, irq_unmask, ...
With the newly introduced irq_domain_alloc_generic_chips(), an interrupt
controller driver can use the irq_domain_chip_generic_info structure and
set the init() hook to perform its generic chips initialization.
In order to avoid a window where the domain is published but not yet
ready to be used, handle the generic chip creation (i.e the
irq_domain_alloc_generic_chips() call) before the domain is published.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240614173232.1184015-16-herve.codina@bootlin.com
Most of generic chip drivers need to perform some more additional
initializations on the generic chips allocated before they can be fully
ready.
These additional initializations need to be performed before the IRQ
domain is published to avoid a race condition between IRQ consumers and
suppliers.
Introduce the init() hook to perform these initializations at the right
place just after the generic chip creation. Also introduce the exit() hook
to allow reverting operations done by the init() hook just before the
generic chip is destroyed.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240614173232.1184015-15-herve.codina@bootlin.com
The existing __irq_alloc_domain_generic_chips() uses a bunch of parameters
to describe the generic chips that need to be allocated.
Adding more parameters and wrappers to hide new parameters in the existing
code leads to more and more code without any relevant values and without
any flexibility.
Introduce irq_domain_alloc_generic_chips() where the generic chips
description is done using the irq_domain_chip_generic_info structure
instead of the bunch of parameters to allow flexibility and easy evolution.
Also introduce irq_domain_remove_generic_chips() to revert the operations
done by irq_domain_alloc_generic_chips().
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240614173232.1184015-14-herve.codina@bootlin.com
The current API does not allow additional initialization before the
domain is published. This can lead to a race condition between consumers
and supplier as a domain can be available for consumers before being
fully ready.
Introduce the init() hook to allow additional initialization before
plublishing the domain. Also introduce the exit() hook to revert
operations done in init() on domain removal.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240614173232.1184015-13-herve.codina@bootlin.com