Where an architecture selects HAVE_STATIC_CALL but not
HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline
which will either branch to a callee or return to the caller.
On such architectures, a number of constraints can conspire to make
those trampolines more complicated and potentially less useful than we'd
like. For example:
* Hardware and software control flow integrity schemes can require the
addition of "landing pad" instructions (e.g. `BTI` for arm64), which
will also be present at the "real" callee.
* Limited branch ranges can require that trampolines generate or load an
address into a register and perform an indirect branch (or at least
have a slow path that does so). This loses some of the benefits of
having a direct branch.
* Interaction with SW CFI schemes can be complicated and fragile, e.g.
requiring that we can recognise idiomatic codegen and remove
indirections understand, at least until clang proves more helpful
mechanisms for dealing with this.
For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we
really only need to enable/disable specific preemption functions. We can
achieve the same effect without a number of the pain points above by
using static keys to fold early returns into the preemption functions
themselves rather than in an out-of-line trampoline, effectively
inlining the trampoline into the start of the function.
For arm64, this results in good code generation. For example, the
dynamic_cond_resched() wrapper looks as follows when enabled. When
disabled, the first `B` is replaced with a `NOP`, resulting in an early
return.
| <dynamic_cond_resched>:
| bti c
| b <dynamic_cond_resched+0x10> // or `nop`
| mov w0, #0x0
| ret
| mrs x0, sp_el0
| ldr x0, [x0, #8]
| cbnz x0, <dynamic_cond_resched+0x8>
| paciasp
| stp x29, x30, [sp, #-16]!
| mov x29, sp
| bl <preempt_schedule_common>
| mov w0, #0x1
| ldp x29, x30, [sp], #16
| autiasp
| ret
... compared to the regular form of the function:
| <__cond_resched>:
| bti c
| mrs x0, sp_el0
| ldr x1, [x0, #8]
| cbz x1, <__cond_resched+0x18>
| mov w0, #0x0
| ret
| paciasp
| stp x29, x30, [sp, #-16]!
| mov x29, sp
| bl <preempt_schedule_common>
| mov w0, #0x1
| ldp x29, x30, [sp], #16
| autiasp
| ret
Any architecture which implements static keys should be able to use this
to implement PREEMPT_DYNAMIC with similar cost to non-inlined static
calls. Since this is likely to have greater overhead than (inlined)
static calls, PREEMPT_DYNAMIC is only defaulted to enabled when
HAVE_PREEMPT_DYNAMIC_CALL is selected.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-6-mark.rutland@arm.com
Now that the enabled/disabled states for the preemption functions are
declared alongside their definitions, the core PREEMPT_DYNAMIC logic is
no longer tied to GENERIC_ENTRY, and can safely be selected so long as
an architecture provides enabled/disabled states for
irqentry_exit_cond_resched().
Make it possible to select HAVE_PREEMPT_DYNAMIC without GENERIC_ENTRY.
For existing users of HAVE_PREEMPT_DYNAMIC there should be no functional
change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-5-mark.rutland@arm.com
Currently callers of irqentry_exit_cond_resched() need to be aware of
whether the function should be indirected via a static call, leading to
ugly ifdeffery in callers.
Save them the hassle with a static inline wrapper that does the right
thing. The raw_irqentry_exit_cond_resched() will also be useful in
subsequent patches which will add conditional wrappers for preemption
functions.
Note: in arch/x86/entry/common.c, xen_pv_evtchn_do_upcall() always calls
irqentry_exit_cond_resched() directly, even when PREEMPT_DYNAMIC is in
use. I believe this is a latent bug (which this patch corrects), but I'm
not entirely certain this wasn't deliberate.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-4-mark.rutland@arm.com
Currently sched_dynamic_update needs to open-code the enabled/disabled
function names for each preemption model it supports, when in practice
this is a boolean enabled/disabled state for each function.
Make this clearer and avoid repetition by defining the enabled/disabled
states at the function definition, and using helper macros to perform the
static_call_update(). Where x86 currently overrides the enabled
function, it is made to provide both the enabled and disabled states for
consistency, with defaults provided by the core code otherwise.
In subsequent patches this will allow us to support PREEMPT_DYNAMIC
without static calls.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-3-mark.rutland@arm.com
The PREEMPT_DYNAMIC logic in kernel/sched/core.c patches static calls
for a bunch of preemption functions. While most are defined prior to
this, the definition of cond_resched() is later in the file, and so we
only have its declarations from include/linux/sched.h.
In subsequent patches we'd like to define some macros alongside the
definition of each of the preemption functions, which we can use within
sched_dynamic_update(). For this to be possible, the PREEMPT_DYNAMIC
logic needs to be placed after the various preemption functions.
As a preparatory step, this patch moves the PREEMPT_DYNAMIC logic after
the various preemption functions, with no other changes -- this is
purely a move.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-2-mark.rutland@arm.com
Where commit 4ef0c5c6b5 ("kernel/sched: Fix sched_fork() access an
invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
race vs syscalls by not placing the task on the runqueue before it
gets exposed through the pidhash.
Commit 13765de814 ("sched/fair: Fix fault in reweight_entity") is
trying to fix a single instance of this, instead fix the whole class
of issues, effectively reverting this commit.
Fixes: 4ef0c5c6b5 ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net
As stated in the comment found in maybe_wait_bpf_programs(),
the synchronize_rcu() barrier is only needed before returning
to userspace, not after each deletion in the batch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20220218181801.2971275-1-eric.dumazet@gmail.com
A few vendor callbacks are only used by VMX, but they return an integer
or bool value. Introduce KVM_X86_OP_OPTIONAL_RET0 for them: if a func is
NULL in struct kvm_x86_ops, it will be changed to __static_call_return0
when updating static calls.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Daniel Borkmann says:
====================
bpf-next 2022-02-17
We've added 29 non-merge commits during the last 8 day(s) which contain
a total of 34 files changed, 1502 insertions(+), 524 deletions(-).
The main changes are:
1) Add BTFGen support to bpftool which allows to use CO-RE in kernels without
BTF info, from Mauricio Vásquez, Rafael David Tinoco, Lorenzo Fontana and
Leonardo Di Donato. (Details: https://lpc.events/event/11/contributions/948/)
2) Prepare light skeleton to be used in both kernel module and user space
and convert bpf_preload.ko to use light skeleton, from Alexei Starovoitov.
3) Rework bpftool's versioning scheme and align with libbpf's version number;
also add linked libbpf version info to "bpftool version", from Quentin Monnet.
4) Add minimal C++ specific additions to bpftool's skeleton codegen to
facilitate use of C skeletons in C++ applications, from Andrii Nakryiko.
5) Add BPF verifier sanity check whether relative offset on kfunc calls overflows
desc->imm and reject the BPF program if the case, from Hou Tao.
6) Fix libbpf to use a dynamically allocated buffer for netlink messages to
avoid receiving truncated messages on some archs, from Toke Høiland-Jørgensen.
7) Various follow-up fixes to the JIT bpf_prog_pack allocator, from Song Liu.
8) Various BPF selftest and vmtest.sh fixes, from Yucong Sun.
9) Fix bpftool pretty print handling on dumping map keys/values when no BTF
is available, from Jiri Olsa and Yinjun Zhang.
10) Extend XDP frags selftest to check for invalid length, from Lorenzo Bianconi.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (29 commits)
bpf: bpf_prog_pack: Set proper size before freeing ro_header
selftests/bpf: Fix crash in core_reloc when bpftool btfgen fails
selftests/bpf: Fix vmtest.sh to launch smp vm.
libbpf: Fix memleak in libbpf_netlink_recv()
bpftool: Fix C++ additions to skeleton
bpftool: Fix pretty print dump for maps without BTF loaded
selftests/bpf: Test "bpftool gen min_core_btf"
bpftool: Gen min_core_btf explanation and examples
bpftool: Implement btfgen_get_btf()
bpftool: Implement "gen min_core_btf" logic
bpftool: Add gen min_core_btf command
libbpf: Expose bpf_core_{add,free}_cands() to bpftool
libbpf: Split bpf_core_apply_relo()
bpf: Reject kfunc calls that overflow insn->imm
selftests/bpf: Add Skeleton templated wrapper as an example
bpftool: Add C++-specific open/load/etc skeleton wrappers
selftests/bpf: Fix GCC11 compiler warnings in -O2 mode
bpftool: Fix the error when lookup in no-btf maps
libbpf: Use dynamically allocated buffer when receiving netlink messages
libbpf: Fix libbpf.map inheritance chain for LIBBPF_0.7.0
...
====================
Link: https://lore.kernel.org/r/20220217232027.29831-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf 2022-02-17
We've added 8 non-merge commits during the last 7 day(s) which contain
a total of 8 files changed, 119 insertions(+), 15 deletions(-).
The main changes are:
1) Add schedule points in map batch ops, from Eric.
2) Fix bpf_msg_push_data with len 0, from Felix.
3) Fix crash due to incorrect copy_map_value, from Kumar.
4) Fix crash due to out of bounds access into reg2btf_ids, from Kumar.
5) Fix a bpf_timer initialization issue with clang, from Yonghong.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add schedule points in batch ops
bpf: Fix crash due to out of bounds access into reg2btf_ids.
selftests: bpf: Check bpf_msg_push_data return value
bpf: Fix a bpf_timer initialization issue
bpf: Emit bpf_timer in vmlinux BTF
selftests/bpf: Add test for bpf_timer overwriting crash
bpf: Fix crash due to incorrect copy_map_value
bpf: Do not try bpf_msg_push_data with len 0
====================
Link: https://lore.kernel.org/r/20220217190000.37925-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot reported various soft lockups caused by bpf batch operations.
INFO: task kworker/1:1:27 blocked for more than 140 seconds.
INFO: task hung in rcu_barrier
Nothing prevents batch ops to process huge amount of data,
we need to add schedule points in them.
Note that maybe_wait_bpf_programs(map) calls from
generic_map_delete_batch() can be factorized by moving
the call after the loop.
This will be done later in -next tree once we get this fix merged,
unless there is strong opinion doing this optimization sooner.
Fixes: aa2e93b8e5 ("bpf: Add generic support for update and delete batch ops")
Fixes: cb4d03ab49 ("bpf: Add generic support for lookup batch op")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Brian Vazquez <brianvv@google.com>
Link: https://lore.kernel.org/bpf/20220217181902.808742-1-eric.dumazet@gmail.com
Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.
Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.
Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).
No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.
Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points. Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.
page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
While examining is_ucounts_overlimit and reading the various messages
I realized that is_ucounts_overlimit fails to deal with counts that
may have wrapped.
Being wrapped should be a transitory state for counts and they should
never be wrapped for long, but it can happen so handle it.
Cc: stable@vger.kernel.org
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Link: https://lkml.kernel.org/r/20220216155832.680775-5-ebiederm@xmission.com
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
During set*id() which cred->ucounts to charge the the current process
to is not known until after set_cred_ucounts. So move the
RLIMIT_NPROC checking into a new helper flag_nproc_exceeded and call
flag_nproc_exceeded after set_cred_ucounts.
This is very much an arbitrary subset of the places where we currently
change the RLIMIT_NPROC accounting, designed to preserve the existing
logic.
Fixing the existing logic will be the subject of another series of
changes.
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220216155832.680775-4-ebiederm@xmission.com
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Michal Koutný <mkoutny@suse.com> wrote:
> Tasks are associated to multiple users at once. Historically and as per
> setrlimit(2) RLIMIT_NPROC is enforce based on real user ID.
>
> The commit 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
> made the accounting structure "indexed" by euid and hence potentially
> account tasks differently.
>
> The effective user ID may be different e.g. for setuid programs but
> those are exec'd into already existing task (i.e. below limit), so
> different accounting is moot.
>
> Some special setresuid(2) users may notice the difference, justifying
> this fix.
I looked at cred->ucount and it is only used for rlimit operations
that were previously stored in cred->user. Making the fact
cred->ucount can refer to a different user from cred->user a bug,
affecting all uses of cred->ulimit not just RLIMIT_NPROC.
Fix set_cred_ucounts to always use the real uid not the effective uid.
Further simplify set_cred_ucounts by noticing that set_cred_ucounts
somehow retained a draft version of the check to see if alloc_ucounts
was needed that checks the new->user and new->user_ns against the
current_real_cred(). Remove that draft version of the check.
All that matters for setting the cred->ucounts are the user_ns and uid
fields in the cred.
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220207121800.5079-4-mkoutny@suse.com
Link: https://lkml.kernel.org/r/20220216155832.680775-3-ebiederm@xmission.com
Reported-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Michal Koutný <mkoutny@suse.com> wrote:
> It was reported that v5.14 behaves differently when enforcing
> RLIMIT_NPROC limit, namely, it allows one more task than previously.
> This is consequence of the commit 21d1c5e386 ("Reimplement
> RLIMIT_NPROC on top of ucounts") that missed the sharpness of
> equality in the forking path.
This can be fixed either by fixing the test or by moving the increment
to be before the test. Fix it my moving copy_creds which contains
the increment before is_ucounts_overlimit.
In the case of CLONE_NEWUSER the ucounts in the task_cred changes.
The function is_ucounts_overlimit needs to use the final version of
the ucounts for the new process. Which means moving the
is_ucounts_overlimit test after copy_creds is necessary.
Both the test in fork and the test in set_user were semantically
changed when the code moved to ucounts. The change of the test in
fork was bad because it was before the increment. The test in
set_user was wrong and the change to ucounts fixed it. So this
fix only restores the old behavior in one lcation not two.
Link: https://lkml.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com
Link: https://lkml.kernel.org/r/20220216155832.680775-2-ebiederm@xmission.com
Cc: stable@vger.kernel.org
Reported-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Solar Designer <solar@openwall.com> wrote:
> I'm not aware of anyone actually running into this issue and reporting
> it. The systems that I personally know use suexec along with rlimits
> still run older/distro kernels, so would not yet be affected.
>
> So my mention was based on my understanding of how suexec works, and
> code review. Specifically, Apache httpd has the setting RLimitNPROC,
> which makes it set RLIMIT_NPROC:
>
> https://httpd.apache.org/docs/2.4/mod/core.html#rlimitnproc
>
> The above documentation for it includes:
>
> "This applies to processes forked from Apache httpd children servicing
> requests, not the Apache httpd children themselves. This includes CGI
> scripts and SSI exec commands, but not any processes forked from the
> Apache httpd parent, such as piped logs."
>
> In code, there are:
>
> ./modules/generators/mod_cgid.c: ( (cgid_req.limits.limit_nproc_set) && ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC,
> ./modules/generators/mod_cgi.c: ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC,
> ./modules/filters/mod_ext_filter.c: rv = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, conf->limit_nproc);
>
> For example, in mod_cgi.c this is in run_cgi_child().
>
> I think this means an httpd child sets RLIMIT_NPROC shortly before it
> execs suexec, which is a SUID root program. suexec then switches to the
> target user and execs the CGI script.
>
> Before 2863643fb8, the setuid() in suexec would set the flag, and the
> target user's process count would be checked against RLIMIT_NPROC on
> execve(). After 2863643fb8, the setuid() in suexec wouldn't set the
> flag because setuid() is (naturally) called when the process is still
> running as root (thus, has those limits bypass capabilities), and
> accordingly execve() would not check the target user's process count
> against RLIMIT_NPROC.
In commit 2863643fb8 ("set_user: add capability check when
rlimit(RLIMIT_NPROC) exceeds") capable calls were added to set_user to
make it more consistent with fork. Unfortunately because of call site
differences those capable calls were checking the credentials of the
user before set*id() instead of after set*id().
This breaks enforcement of RLIMIT_NPROC for applications that set the
rlimit and then call set*id() while holding a full set of
capabilities. The capabilities are only changed in the new credential
in security_task_fix_setuid().
The code in apache suexec appears to follow this pattern.
Commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits
(RLIMIT_NPROC)") where this check was added describes the targes of this
capability check as:
2/ When a root-owned process (e.g. cgiwrap) sets up process limits and then
calls setuid, the setuid should fail if the user would then be running
more than rlim_cur[RLIMIT_NPROC] processes, but it doesn't. This patch
adds an appropriate test. With this patch, and per-user process limit
imposed in cgiwrap really works.
So the original use case of this check also appears to match the broken
pattern.
Restore the enforcement of RLIMIT_NPROC by removing the bad capable
checks added in set_user. This unfortunately restores the
inconsistent state the code has been in for the last 11 years, but
dealing with the inconsistencies looks like a larger problem.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20210907213042.GA22626@openwall.com/
Link: https://lkml.kernel.org/r/20220212221412.GA29214@openwall.com
Link: https://lkml.kernel.org/r/20220216155832.680775-1-ebiederm@xmission.com
Fixes: 2863643fb8 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reviewed-by: Solar Designer <solar@openwall.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Sysfs support might be disabled so we need to guard the code that
instantiates "compression" attribute with an #ifdef.
Fixes: b1ae6dc41e ("module: add in-kernel support for decompressing")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
When commit e6ac2450d6 ("bpf: Support bpf program calling kernel function") added
kfunc support, it defined reg2btf_ids as a cheap way to translate the verifier
reg type to the appropriate btf_vmlinux BTF ID, however
commit c25b2ae136 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL")
moved the __BPF_REG_TYPE_MAX from the last member of bpf_reg_type enum to after
the base register types, and defined other variants using type flag
composition. However, now, the direct usage of reg->type to index into
reg2btf_ids may no longer fall into __BPF_REG_TYPE_MAX range, and hence lead to
out of bounds access and kernel crash on dereference of bad pointer.
Fixes: c25b2ae136 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220216201943.624869-1-memxor@gmail.com
As 'swsusp_check' open 'hib_resume_bdev', if call 'create_basic_memory_bitmaps'
failed, we need to close 'hib_resume_bdev' in 'load_image_and_restore' function.
Signed-off-by: Ye Bin <yebin10@huawei.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
BTFGen needs to run the core relocation logic in order to understand
what are the types involved in a given relocation.
Currently bpf_core_apply_relo() calculates and **applies** a relocation
to an instruction. Having both operations in the same function makes it
difficult to only calculate the relocation without patching the
instruction. This commit splits that logic in two different phases: (1)
calculate the relocation and (2) patch the instruction.
For the first phase bpf_core_apply_relo() is renamed to
bpf_core_calc_relo_insn() who is now only on charge of calculating the
relocation, the second phase uses the already existing
bpf_core_patch_insn(). bpf_object__relocate_core() uses both of them and
the BTFGen will use only bpf_core_calc_relo_insn().
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Rafael David Tinoco <rafael.tinoco@aquasec.com>
Signed-off-by: Lorenzo Fontana <lorenzo.fontana@elastic.co>
Signed-off-by: Leonardo Di Donato <leonardo.didonato@elastic.co>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220215225856.671072-2-mauricio@kinvolk.io
When dumping lock_classes information via /proc/lockdep, we can't take
the lockdep lock as the lock hold time is indeterminate. Iterating
over all_lock_classes without holding lock can be dangerous as there
is a slight chance that it may branch off to other lists leading to
infinite loop or even access invalid memory if changes are made to
all_lock_classes list in parallel.
To avoid this problem, iteration of lock classes is now done directly
on the lock_classes array itself. The lock_classes_in_use bitmap is
checked to see if the lock class is being used. To avoid iterating
the full array all the times, a new max_lock_class_idx value is added
to track the maximum lock_class index that is currently being used.
We can theoretically take the lockdep lock for iterating all_lock_classes
when other lockdep files (lockdep_stats and lock_stat) are accessed as
the lock hold time will be shorter for them. For consistency, they are
also modified to iterate the lock_classes array directly.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220211035526.1329503-2-longman@redhat.com
To prepare for supporting each housekeeping feature toward cpuset, split
the global housekeeping cpumask per HK_TYPE_* entry.
This will later allow, for example, to runtime modify the cpulist passed
through "isolcpus=", "nohz_full=" and "rcu_nocbs=" kernel boot
parameters.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-9-frederic@kernel.org
If "nohz_full=" or "isolcpus=nohz" are called with CONFIG_NO_HZ_FULL=n,
housekeeping_mask doesn't get freed despite it being unused if
housekeeping_setup() is called for the first time.
Check this scenario first to fix this, so that no useless allocation
is performed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-8-frederic@kernel.org
Centralize the mask freeing and return value for the error path. This
makes potential leaks more visible.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-7-frederic@kernel.org
There can be two subsequent calls to housekeeping_setup() due to
"nohz_full=" and "isolcpus=" that can mix up. The two passes each have
their own way to deal with an empty housekeeping set of CPUs.
Consolidate this part and remove the awful "tmp" based naming.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-6-frederic@kernel.org
Refer to housekeeping APIs using single feature types instead of flags.
This prevents from passing multiple isolation features at once to
housekeeping interfaces, which soon won't be possible anymore as each
isolation features will have their own cpumask.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
To prepare for supporting each feature of the housekeeping cpumask
toward cpuset, prepare each of the HK_FLAG_* entries to move to their
own cpumask with enforcing to fetch them individually. The new
constraint is that multiple HK_FLAG_* entries can't be mixed together
anymore in a single call to housekeeping cpumask().
This will later allow, for example, to runtime modify the cpulist passed
through "isolcpus=", "nohz_full=" and "rcu_nocbs=" kernel boot
parameters.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220207155910.527133-3-frederic@kernel.org
When a new threshold breaching stall happens after a psi event was
generated and within the window duration, the new event is not
generated because the events are rate-limited to one per window. If
after that no new stall is recorded then the event will not be
generated even after rate-limiting duration has passed. This is
happening because with no new stall, window_update will not be called
even though threshold was previously breached. To fix this, record
threshold breaching occurrence and generate the event once window
duration is passed.
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/1643093818-19835-1-git-send-email-huangzhaoyang@gmail.com
In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
nodes. But if the number of the hint page faults on a PMEM node is
the max for a task, The current NUMA balancing policy may try to place
the task on the PMEM node instead of DRAM node. This is unreasonable,
because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less
nodes are ignored when searching the migration target node for a task
in this patch.
To test the patch, we run a workload that accesses more memory in PMEM
node than memory in DRAM node. Without the patch, the PMEM node will
be chosen as preferred node in task_numa_placement(). While the DRAM
node will be chosen instead with the patch.
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com
The NUMA topology parameters (sched_numa_topology_type,
sched_domains_numa_levels, and sched_max_numa_distance, etc.)
identified by scheduler may be wrong for systems with CPU-less nodes.
For example, the ACPI SLIT of a system with CPU-less persistent
memory (Intel Optane DCPMM) nodes is as follows,
[000h 0000 4] Signature : "SLIT" [System Locality Information Table]
[004h 0004 4] Table Length : 0000042C
[008h 0008 1] Revision : 01
[009h 0009 1] Checksum : 59
[00Ah 0010 6] Oem ID : "XXXX"
[010h 0016 8] Oem Table ID : "XXXXXXX"
[018h 0024 4] Oem Revision : 00000001
[01Ch 0028 4] Asl Compiler ID : "INTL"
[020h 0032 4] Asl Compiler Revision : 20091013
[024h 0036 8] Localities : 0000000000000004
[02Ch 0044 4] Locality 0 : 0A 15 11 1C
[030h 0048 4] Locality 1 : 15 0A 1C 11
[034h 0052 4] Locality 2 : 11 1C 0A 1C
[038h 0056 4] Locality 3 : 1C 11 1C 0A
While the `numactl -H` output is as follows,
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 64136 MB
node 0 free: 5981 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 64466 MB
node 1 free: 10415 MB
node 2 cpus:
node 2 size: 253952 MB
node 2 free: 253920 MB
node 3 cpus:
node 3 size: 253952 MB
node 3 free: 253951 MB
node distances:
node 0 1 2 3
0: 10 21 17 28
1: 21 10 28 17
2: 17 28 10 28
3: 28 17 28 10
In this system, there are only 2 sockets. In each memory controller,
both DRAM and PMEM DIMMs are installed. Although the physical NUMA
topology is simple, the logical NUMA topology becomes a little
complex. Because both the distance(0, 1) and distance (1, 3) are less
than the distance (0, 3), it appears that node 1 sits between node 0
and node 3. And the whole system appears to be a glueless mesh NUMA
topology type. But it's definitely not, there is even no CPU in node 3.
This isn't a practical problem now yet. Because the PMEM nodes (node
2 and node 3 in example system) are offlined by default during system
boot. So init_numa_topology_type() called during system boot will
ignore them and set sched_numa_topology_type to NUMA_DIRECT. And
init_numa_topology_type() is only called at runtime when a CPU of a
never-onlined-before node gets plugged in. And there's no CPU in the
PMEM nodes. But it appears better to fix this to make the code more
robust.
To test the potential problem. We have used a debug patch to call
init_numa_topology_type() when the PMEM node is onlined (in
__set_migration_target_nodes()). With that, the NUMA parameters
identified by scheduler is as follows,
sched_numa_topology_type: NUMA_GLUELESS_MESH
sched_domains_numa_levels: 4
sched_max_numa_distance: 28
To fix the issue, the CPU-less nodes are ignored when the NUMA topology
parameters are identified. Because a node may become CPU-less or not
at run time because of CPU hotplug, the NUMA topology parameters need
to be re-initialized at runtime for CPU hotplug too.
With the patch, the NUMA parameters identified for the example system
above is as follows,
sched_numa_topology_type: NUMA_DIRECT
sched_domains_numa_levels: 2
sched_max_numa_distance: 21
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com
In some places, kernel/sched code calls cpumask_weight() to check if
any bit of a given cpumask is set. We can do it more efficiently with
cpumask_empty() because cpumask_empty() stops traversing the cpumask as
soon as it finds first set bit, while cpumask_weight() counts all bits
unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220210224933.379149-23-yury.norov@gmail.com
WRITE_KERN is supposed to overwrite some kernel text, namely
do_overwritten() function.
But at the time being it overwrites do_overwritten() function
descriptor, not function text.
Fix it by dereferencing the function descriptor to obtain
function text pointer. Export dereference_function_descriptor()
for when LKDTM is built as a module.
And make do_overwritten() noinline so that it is really
do_overwritten() which is called by lkdtm_WRITE_KERN().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/31e58eaffb5bc51c07d8d4891d1982100ade8cfc.1644928018.git.christophe.leroy@csgroup.eu
dereference_function_descriptor() and
dereference_kernel_function_descriptor() are identical on the
three architectures implementing them.
Make them common and put them out-of-line in kernel/extable.c
which is one of the users and has similar type of functions.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/449db09b2eba57f4ab05f80102a67d8675bc8bcd.1644928018.git.christophe.leroy@csgroup.eu
Now kfunc call uses s32 to represent the offset between the address of
kfunc and __bpf_call_base, but it doesn't check whether or not s32 will
be overflowed. The overflow is possible when kfunc is in module and the
offset between module and kernel is greater than 2GB. Take arm64 as an
example, before commit b2eed9b588 ("arm64/kernel: kaslr: reduce module
randomization range to 2 GB"), the offset between module symbol and
__bpf_call_base will in 4GB range due to KASLR and may overflow s32.
So add an extra checking to reject these invalid kfunc calls.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220215065732.3179408-1-houtao1@huawei.com
The per-cpu @printk_pending variable can be updated from
sleepable contexts, such as:
get_random_bytes()
warn_unseeded_randomness()
printk_deferred()
defer_console_output()
and can be updated from interrupt contexts, such as:
handle_irq_event_percpu()
__irq_wake_thread()
wake_up_process()
try_to_wake_up()
select_task_rq()
select_fallback_rq()
printk_deferred()
defer_console_output()
and can be updated from NMI contexts, such as:
vprintk()
if (in_nmi()) defer_console_output()
Therefore the atomic variant of the updating functions must be used.
Replace __this_cpu_xchg() with this_cpu_xchg().
Replace __this_cpu_or() with this_cpu_or().
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/87iltld4ue.fsf@jogness.linutronix.de
In order to let a const irqchip be fed to the irqchip layer, adjust
the various prototypes. An extra cast in irq_set_chip()() is required
to avoid a warning.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20220209162607.1118325-3-maz@kernel.org
In order to let a const irqchip be fed to the irqchip layer, adjust
the various prototypes. An extra cast in irq_domain_set_hwirq_and_chip()
is required to avoid a warning.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20220209162607.1118325-2-maz@kernel.org
Add a new single bit field to the task structure to track whether this task
has initialized the IA32_PASID MSR to the mm's PASID.
Initialize the field to zero when creating a new task with fork/clone.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-8-fenghua.yu@intel.com
PASIDs are process-wide. It was attempted to use refcounted PASIDs to
free them when the last thread drops the refcount. This turned out to
be complex and error prone. Given the fact that the PASID space is 20
bits, which allows up to 1M processes to have a PASID associated
concurrently, PASID resource exhaustion is not a realistic concern.
Therefore, it was decided to simplify the approach and stick with lazy
on demand PASID allocation, but drop the eager free approach and make an
allocated PASID's lifetime bound to the lifetime of the process.
Get rid of the refcounting mechanisms and replace/rename the interfaces
to reflect this new approach.
[ bp: Massage commit message. ]
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-6-fenghua.yu@intel.com
As previously discussed(https://lkml.org/lkml/2022/1/20/51),
cpuset_attach() is affected with similar cpu hotplug race,
as follow scenario:
cpuset_attach() cpu hotplug
--------------------------- ----------------------
down_write(cpuset_rwsem)
guarantee_online_cpus() // (load cpus_attach)
sched_cpu_deactivate
set_cpu_active()
// will change cpu_active_mask
set_cpus_allowed_ptr(cpus_attach)
__set_cpus_allowed_ptr_locked()
// (if the intersection of cpus_attach and
cpu_active_mask is empty, will return -EINVAL)
up_write(cpuset_rwsem)
To avoid races such as described above, protect cpuset_attach() call
with cpu_hotplug_lock.
Fixes: be367d0992 ("cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time")
Cc: stable@vger.kernel.org # v2.6.32+
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A new mm doesn't have a PASID yet when it's created. Initialize
the mm's PASID on fork() or for init_mm to INVALID_IOASID (-1).
INIT_PASID (0) is reserved for kernel legacy DMA PASID. It cannot be
allocated to a user process. Initializing the process's PASID to 0 may
cause confusion that's why the process uses the reserved kernel legacy
DMA PASID. Initializing the PASID to INVALID_IOASID (-1) explicitly
tells the process doesn't have a valid PASID yet.
Even though the only user of mm_pasid_init() is in fork.c, define it in
<linux/sched/mm.h> as the first of three mm/pasid life cycle functions
(init/set/drop) to keep these all together.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-5-fenghua.yu@intel.com
In some places, RCU code calls cpumask_weight() to check if any bit of a
given cpumask is set. We can do it more efficiently with cpumask_empty()
because cpumask_empty() stops traversing the cpumask as soon as it finds
first set bit, while cpumask_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This is a rarely used function, so uninlining its 3 instructions
is probably a win or a wash - but the main motivation is to
make <linux/rcuwait.h> independent of task_struct details.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
KCSAN reports data races between the rcu_segcblist_clear_flags() and
rcu_segcblist_set_flags() functions, though misreporting the latter
as a call to rcu_segcblist_is_enabled() from call_rcu(). This commit
converts the updates of this field to WRITE_ONCE(), relying on the
resulting unmarked reads to continue to detect buggy concurrent writes
to this field.
Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Recording the work creation stack trace for KASAN reports in
call_rcu() is expensive, due to unwinding the stack, but also
due to acquiring depot_lock inside stackdepot (which may be contended).
Because calling kasan_record_aux_stack_noalloc() does not require
interrupts to already be disabled, this may unnecessarily extend
the time with interrupts disabled.
Therefore, move calling kasan_record_aux_stack() before the section
with interrupts disabled.
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>