CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.
Switch kprobes conditional over to CONFIG_PREEMPTION.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.516286187@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.
Switch the conditionals in the tracer over to CONFIG_PREEMPTION.
This is the first step to make the tracer work on RT. The other small
tweaks are submitted separately.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.409766323@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.
Switch the conditionals in RCU to use CONFIG_PREEMPTION.
That's the first step towards RCU on RT. The further tweaks are work in
progress. This neither touches the selftest bits which need a closer look
by Paul.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.210156346@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.
Switch the preemption code, scheduler and init task over to use
CONFIG_PREEMPTION.
That's the first step towards RT in that area. The more complex changes are
coming separately.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.117528401@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We already have tested it before. The second one should be removed.
With this change, the performance should have little improvement.
Link: http://lkml.kernel.org/r/20190730140850.7927-1-changbin.du@gmail.com
Cc: stable@vger.kernel.org
Fixes: 9cd2992f2d ("fgraph: Have set_graph_notrace only affect function_graph tracer")
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
All callers hand in 'current' and that's the only task pointer which
actually makes sense. Remove the task argument and set current in the
function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185752.791885290@linutronix.de
Since commit b191d6491b ("pidfd: fix a poll race when setting exit_state")
we unconditionally set exit_state to EXIT_ZOMBIE before calling into
do_notify_parent(). This was done to eliminate a race when querying
exit_state in do_notify_pidfd().
Back then we decided to do the absolute minimal thing to fix this and
not touch the rest of the exit_notify() function where exit_state is
set.
Since this fix has not caused any issues change the setting of
exit_state to EXIT_DEAD in the autoreap case to account for the fact hat
exit_state is set to EXIT_ZOMBIE unconditionally. This fix was planned
but also explicitly requested in [1] and makes the whole code more
consistent.
/* References */
[1]: https://lore.kernel.org/lkml/CAHk-=wigcxGFR2szue4wavJtH5cYTTeNES=toUBVGsmX0rzX+g@mail.gmail.com
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
The FSF does not reside in "675 Mass Ave, Cambridge" anymore...
let's replace the old GPL boilerplate code with a proper SPDX
identifier instead.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some arches (e.g., arm64, x86) have moved towards non-executable
module_alloc() allocations for security hardening reasons. That means
that the module loader will need to set the text section of a module to
executable, regardless of whether or not CONFIG_STRICT_MODULE_RWX is set.
When CONFIG_STRICT_MODULE_RWX=y, module section allocations are always
page-aligned to handle memory rwx permissions. On some arches with
CONFIG_STRICT_MODULE_RWX=n however, when setting the module text to
executable, the BUG_ON() in frob_text() gets triggered since module
section allocations are not page-aligned when CONFIG_STRICT_MODULE_RWX=n.
Since the set_memory_* API works with pages, and since we need to call
set_memory_x() regardless of whether CONFIG_STRICT_MODULE_RWX is set, we
might as well page-align all module section allocations for ease of
managing rwx permissions of module sections (text, rodata, etc).
Fixes: 2eef1399a8 ("modules: fix BUG when load module with rodata=n")
Reported-by: Martin Kaiser <lists@kaiser.cx>
Reported-by: Bartosz Golaszewski <brgl@bgdev.pl>
Tested-by: David Lechner <david@lechnology.com>
Tested-by: Martin Kaiser <martin@kaiser.cx>
Tested-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
A common pattern when using xdp_redirect_map() is to create a device map
where the lookup key is simply ifindex. Because device maps are arrays,
this leaves holes in the map, and the map has to be sized to fit the
largest ifindex, regardless of how many devices actually are actually
needed in the map.
This patch adds a second type of device map where the key is looked up
using a hashmap, instead of being used as an array index. This allows maps
to be densely packed, so they can be smaller.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The subsequent patch to add a new devmap sub-type can re-use much of the
initialisation and allocation code, so refactor it into separate functions.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Previously a condition got missed where the pidfd waiters are awakened
before the exit_state gets set. This can result in a missed notification
[1] and the polling thread waiting forever.
It is fixed now, however it would be nice to avoid this kind of issue
going unnoticed in the future. So just add a warning to catch it in the
future.
/* References */
[1]: https://lore.kernel.org/lkml/20190717172100.261204-1-joel@joelfernandes.org/
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20190724164816.201099-1-joel@joelfernandes.org
Signed-off-by: Christian Brauner <christian@brauner.io>
According to the original dma_direct_alloc_pages() code:
{
unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
if (!dma_release_from_contiguous(dev, page, count))
__free_pages(page, get_order(size));
}
The count parameter for dma_release_from_contiguous() was page
aligned before the right-shifting operation, while the new API
dma_free_contiguous() forgets to have PAGE_ALIGN() at the size.
So this patch simply adds it to prevent any corner case.
Fixes: fdaeec198ada ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers")
Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The dma_alloc_contiguous() limits align at CONFIG_CMA_ALIGNMENT for
cma_alloc() however it does not restore it for the fallback routine.
This will result in a size mismatch between the allocation and free
when running into the fallback routines after cma_alloc() fails, if
the align is larger than CONFIG_CMA_ALIGNMENT.
This patch adds a cma_align to take care of cma_alloc() and prevent
the align from being overwritten.
Fixes: fdaeec198ada ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers")
Reported-by: Dafna Hirschfeld <dafna.hirschfeld@collabora.com>
Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Pull scheduler fixes from Thomas Gleixner:
"Two fixes for the fair scheduling class:
- Prevent freeing memory which is accessible by concurrent readers
- Make the RCU annotations for numa groups consistent"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Use RCU accessors consistently for ->numa_group
sched/fair: Don't free p->numa_faults with concurrent readers
Pull perf fixes from Thomas Gleixner:
"A pile of perf related fixes:
Kernel:
- Fix SLOTS PEBS event constraints for Icelake CPUs
- Add the missing mask bit to allow counting hardware generated
prefetches on L3 for Icelake CPUs
- Make the test for hypervisor platforms more accurate (as far as
possible)
- Handle PMUs correctly which override event->cpu
- Yet another missing fallthrough annotation
Tools:
perf.data:
- Fix loading of compressed data split across adjacent records
- Fix buffer size setting for processing CPU topology perf.data
header.
perf stat:
- Fix segfault for event group in repeat mode
- Always separate "stalled cycles per insn" line, it was being
appended to the "instructions" line.
perf script:
- Fix --max-blocks man page description.
- Improve man page description of metrics.
- Fix off by one in brstackinsn IPC computation.
perf probe:
- Avoid calling freeing routine multiple times for same pointer.
perf build:
- Do not use -Wshadow on gcc < 4.8, avoiding too strict warnings
treated as errors, breaking the build"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Mark expected switch fall-throughs
perf/core: Fix creating kernel counters for PMUs that override event->cpu
perf/x86: Apply more accurate check on hypervisor platform
perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register
perf/x86/intel: Fix SLOTS PEBS event constraint
perf build: Do not use -Wshadow on gcc < 4.8
perf probe: Avoid calling freeing routine multiple times for same pointer
perf probe: Set pev->nargs to zero after freeing pev->args entries
perf session: Fix loading of compressed data split across adjacent records
perf stat: Always separate stalled cycles per insn
perf stat: Fix segfault for event group in repeat mode
perf tools: Fix proper buffer size for feature processing
perf script: Fix off by one in brstackinsn IPC computation
perf script: Improve man page description of metrics
perf script: Fix --max-blocks man page description
Pull locking fixes from Thomas Gleixner:
"A set of locking fixes:
- Address the fallout of the rwsem rework. Missing ACQUIREs and a
sanity check to prevent a use-after-free
- Add missing checks for unitialized mutexes when mutex debugging is
enabled.
- Remove the bogus code in the generic SMP variant of
arch_futex_atomic_op_inuser()
- Fixup the #ifdeffery in lockdep to prevent compile warnings"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/mutex: Test for initialized mutex
locking/lockdep: Clean up #ifdef checks
locking/lockdep: Hide unused 'class' variable
locking/rwsem: Add ACQUIRE comments
tty/ldsem, locking/rwsem: Add missing ACQUIRE to read_failed sleep loop
lcoking/rwsem: Add missing ACQUIRE to read_slowpath sleep loop
locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty
locking/rwsem: Don't call owner_on_cpu() on read-owner
futex: Cleanup generic SMP variant of arch_futex_atomic_op_inuser()
With the removal of the padata timer, padata_do_serial no longer
needs special CPU handling, so remove it.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The function padata_reorder will use a timer when it cannot progress
while completed jobs are outstanding (pd->reorder_objects > 0). This
is suboptimal as if we do end up using the timer then it would have
introduced a gratuitous delay of one second.
In fact we can easily distinguish between whether completed jobs
are outstanding and whether we can make progress. All we have to
do is look at the next pqueue list.
This patch does that by replacing pd->processed with pd->cpu so
that the next pqueue is more accessible.
A work queue is used instead of the original try_again to avoid
hogging the CPU.
Note that we don't bother removing the work queue in
padata_flush_queues because the whole premise is broken. You
cannot flush async crypto requests so it makes no sense to even
try. A subsequent patch will fix it by replacing it with a ref
counting scheme.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Alexei Starovoitov says:
====================
pull-request: bpf 2019-07-25
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) fix segfault in libbpf, from Andrii.
2) fix gso_segs access, from Eric.
3) tls/sockmap fixes, from Jakub and John.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The access() (and faccessat()) credentials change can cause an
unnecessary load on the RCU machinery because every access() call ends
up freeing the temporary access credential using RCU.
This isn't really noticeable on small machines, but if you have hundreds
of cores you can cause huge slowdowns due to RCU storms.
It's easy to avoid: the temporary access crededntials aren't actually
normally accessed using RCU at all, so we can avoid the whole issue by
just marking them as such.
* access-creds:
access: avoid the RCU grace period for the temporary subjective credentials
Compiling a kernel with both FAIR_GROUP_SCHED=n and RT_GROUP_SCHED=n
will generate a compiler warning:
kernel/sched/core.c: In function 'sched_init':
kernel/sched/core.c:5906:32: warning: variable 'ptr' set but not used
It is unnecessary to have both "alloc_size" and "ptr", so just combine
them.
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/20190720012319.884-1-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On !CONFIG_RT_GROUP_SCHED configurations it is currently not possible to
move RT tasks between cgroups to which CPU controller has been attached;
but it is oddly possible to first move tasks around and then make them
RT (setschedule to FIFO/RR).
E.g.:
# mkdir /sys/fs/cgroup/cpu,cpuacct/group1
# chrt -fp 10 $$
# echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
bash: echo: write error: Invalid argument
# chrt -op 0 $$
# echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
# chrt -fp 10 $$
# cat /sys/fs/cgroup/cpu,cpuacct/group1/tasks
2345
2598
# chrt -p 2345
pid 2345's current scheduling policy: SCHED_FIFO
pid 2345's current scheduling priority: 10
Also, as Michal noted, it is currently not possible to enable CPU
controller on unified hierarchy with !CONFIG_RT_GROUP_SCHED (if there
are any kernel RT threads in root cgroup, they can't be migrated to the
newly created CPU controller's root in cgroup_update_dfl_csses()).
Existing code comes with a comment saying the "we don't support RT-tasks
being in separate groups". Such comment is however stale and belongs to
pre-RT_GROUP_SCHED times. Also, it doesn't make much sense for
!RT_GROUP_ SCHED configurations, since checks related to RT bandwidth
are not performed at all in these cases.
Make moving RT tasks between CPU controller groups viable by removing
special case check for RT (and DEADLINE) tasks.
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Link: https://lkml.kernel.org/r/20190719063455.27328-1-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
No synchronisation mechanism exists between the cpuset subsystem and
calls to function __sched_setscheduler(). As such, it is possible that
new root domains are created on the cpuset side while a deadline
acceptance test is carried out in __sched_setscheduler(), leading to a
potential oversell of CPU bandwidth.
Grab cpuset_rwsem read lock from core scheduler, so to prevent
situations such as the one described above from happening.
The only exception is normalize_rt_tasks() which needs to work under
tasklist_lock and can't therefore grab cpuset_rwsem. We are fine with
this, as this function is only called by sysrq and, if that gets
triggered, DEADLINE guarantees are already gone out of the window
anyway.
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-9-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cpuset_rwsem is going to be acquired from sched_setscheduler() with a
following patch. There are however paths (e.g., spawn_ksoftirqd) in
which sched_scheduler() is eventually called while holding hotplug lock;
this creates a dependecy between hotplug lock (to be always acquired
first) and cpuset_rwsem (to be always acquired after hotplug lock).
Fix paths which currently take the two locks in the wrong order (after
a following patch is applied).
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-7-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Holding cpuset_mutex means that cpusets are stable (only the holder can
make changes) and this is required for fixing a synchronization issue
between cpusets and scheduler core. However, grabbing cpuset_mutex from
setscheduler() hotpath (as implemented in a later patch) is a no-go, as
it would create a bottleneck for tasks concurrently calling
setscheduler().
Convert cpuset_mutex to be a percpu_rwsem (cpuset_rwsem), so that
setscheduler() will then be able to read lock it and avoid concurrency
issues.
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-6-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.
This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[ Various additional modifications. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Calls to task_rq_unlock() are done several times in the
__sched_setscheduler() function. This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.
This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.
No change of functionality is introduced by this patch.
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-3-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce the partition_sched_domains_locked() function by taking
the mutex locking code out of the original function. That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.
No change of functionality is introduced by this patch.
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-2-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The same formula to check utilization against capacity (after
considering capacity_margin) is already used at 5 different locations.
This patch creates a new macro, fits_capacity(), which can be used from
all these locations without exposing the details of it and hence
simplify code.
All the 5 code locations are updated as well to use it..
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/b477ac75a2b163048bdaeb37f57b4c3f04f75a31.1559631700.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In real product setup, there will be houseeking CPUs in each nodes, it
is prefer to do housekeeping from local node, fallback to global online
cpumask if failed to find houseeking CPU from local node.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1561711901-4755-2-git-send-email-wanpengli@tencent.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sched_info_on() is called with unlikely hint, however, the test
is to be a constant(1) on which compiler will do nothing when
make defconfig, so remove the hint.
Also, fix a lack of {}.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: up2wing@gmail.com
Cc: wang.liang82@zte.com.cn
Cc: xue.zhihong@zte.com.cn
Link: https://lkml.kernel.org/r/1562301307-43002-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Returning the pointer that was passed in allows us to write
slightly more idiomatic code. Convert a few users.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190704221323.24290-1-willy@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We try to find an idle CPU to run the next task, but in case we don't
find an idle CPU it is better to pick a CPU which will run the task the
soonest, for performance reason.
A CPU which isn't idle but has only SCHED_IDLE activity queued on it
should be a good target based on this criteria as any normal fair task
will most likely preempt the currently running SCHED_IDLE task
immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one
shall give better results as it should be able to run the task sooner
than an idle CPU (which requires to be woken up from an idle state).
This patch updates both fast and slow paths with this optimization.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: chris.redpath@arm.com
Cc: quentin.perret@linaro.org
Cc: songliubraving@fb.com
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Cc: tkjos@google.com
Link: https://lkml.kernel.org/r/eeafa25fdeb6f6edd5b2da716bc8f0ba7708cbcf.1561523542.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
The TASKS03 and TREE04 rcutorture scenarios produce the following
lockdep complaint:
WARNING: inconsistent lock state
5.2.0-rc1+ #513 Not tainted
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
(____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
{IN-HARDIRQ-W} state was registered at:
lock_acquire+0xb0/0x1c0
_raw_spin_lock_irqsave+0x3c/0x50
tick_broadcast_switch_to_oneshot+0xd/0x40
tick_switch_to_oneshot+0x4f/0xd0
hrtimer_run_queues+0xf3/0x130
run_local_timers+0x1c/0x50
update_process_times+0x1c/0x50
tick_periodic+0x26/0xc0
tick_handle_periodic+0x1a/0x60
smp_apic_timer_interrupt+0x80/0x2a0
apic_timer_interrupt+0xf/0x20
_raw_spin_unlock_irqrestore+0x4e/0x60
rcu_nocb_gp_kthread+0x15d/0x590
kthread+0xf3/0x130
ret_from_fork+0x3a/0x50
irq event stamp: 171
hardirqs last enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
softirqs last disabled at (0): [<0000000000000000>] 0x0
[...]
To reproduce, run the following rcutorture test:
$ tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"
It turns out that tick_broadcast_offline() was an innocent bystander.
After all, interrupts are supposed to be disabled throughout
take_cpu_down(), and therefore should have been disabled upon entry to
tick_offline_cpu() and thus to tick_broadcast_offline(). This suggests
that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
and leaving them enabled on return.
Some debugging code showed that the culprit was sched_cpu_dying().
It had irqs enabled after return from sched_tick_stop(). Which in turn
had irqs enabled after return from cancel_delayed_work_sync(). Which is a
wrapper around __cancel_work_timer(). Which can sleep in the case where
something else is concurrently trying to cancel the same delayed work,
and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
idea when you are invoked from take_cpu_down(), regardless of the state
you leave interrupts in upon return.
Code inspection located no reason why the delayed work absolutely
needed to be canceled from sched_tick_stop(): The work is not
bound to the outgoing CPU by design, given that the whole point is
to collect statistics without disturbing the outgoing CPU.
This commit therefore simply drops the cancel_delayed_work_sync() from
sched_tick_stop(). Instead, a new ->state field is added to the tick_work
structure so that the delayed-work handler function sched_tick_remote()
can avoid reposting itself. A cpu_is_offline() check is also added to
sched_tick_remote() to avoid mucking with the state of an offlined CPU
(though it does appear safe to do so). The sched_tick_start() and
sched_tick_stop() functions also update ->state, and sched_tick_start()
also schedules the delayed work if ->state indicates that it is not
already in flight.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190625165238.GJ26519@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The load_balance() has a dedicated mecanism to detect when an imbalance
is due to CPU affinity and must be handled at parent level. In this case,
the imbalance field of the parent's sched_group is set.
The description of sg_imbalanced() gives a typical example of two groups
of 4 CPUs each and 4 tasks each with a cpumask covering 1 CPU of the first
group and 3 CPUs of the second group. Something like:
{ 0 1 2 3 } { 4 5 6 7 }
* * * *
But the load_balance fails to fix this UC on my octo cores system
made of 2 clusters of quad cores.
Whereas the load_balance is able to detect that the imbalanced is due to
CPU affinity, it fails to fix it because the imbalance field is cleared
before letting parent level a chance to run. In fact, when the imbalance is
detected, the load_balance reruns without the CPU with pinned tasks. But
there is no other running tasks in the situation described above and
everything looks balanced this time so the imbalance field is immediately
cleared.
The imbalance field should not be cleared if there is no other task to move
when the imbalance is detected.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1561996022-28829-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We only need to set the callback_head worker function once, do it
during sched_fork().
While at it, move the comment regarding double task_work addition to
init_numa_balancing(), since the double add sentinel is first set there.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Cc: riel@surriel.com
Link: https://lkml.kernel.org/r/20190715102508.32434-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To reference task_numa_work() from within init_numa_balancing(), we
need the former to be declared before the latter. Do just that.
This is a pure code movement.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Cc: riel@surriel.com
Link: https://lkml.kernel.org/r/20190715102508.32434-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Re-evaluating the bitmap wheight of the online cpus bitmap in every
invocation of num_online_cpus() over and over is a pretty useless
exercise. Especially when num_online_cpus() is used in code paths
like the IPI delivery of x86 or the membarrier code.
Cache the number of online CPUs in the core and just return the cached
variable. The accessor function provides only a snapshot when used without
protection against concurrent CPU hotplug.
The storage needs to use an atomic_t because the kexec and reboot code
(ab)use set_cpu_online() in their 'shutdown' handlers without any form of
serialization as pointed out by Mathieu. Regular CPU hotplug usage is
properly serialized.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907091622590.1634@nanos.tec.linutronix.de
The booted once information which is required to deal with the MCE
broadcast issue on X86 correctly is stored in the per cpu hotplug state,
which is perfectly fine for the intended purpose.
X86 needs that information for supporting NMI broadcasting via shortcuts,
but retrieving it from per cpu data is cumbersome.
Move it to a cpumask so the information can be checked against the
cpu_present_mask quickly.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190722105219.818822855@linutronix.de
Report the number of stack traces and the number of stack trace hash
chains. These two numbers are useful because these allow to estimate
the number of stack trace hash collisions.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-5-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Although commit 669de8bda8 ("kernel/workqueue: Use dynamic lockdep keys
for workqueues") unregisters dynamic lockdep keys when a workqueue is
destroyed, a side effect of that commit is that all stack traces
associated with the lockdep key are leaked when a workqueue is destroyed.
Fix this by storing each unique stack trace once. Other changes in this
patch are:
- Use NULL instead of { .nr_entries = 0 } to represent 'no trace'.
- Store a pointer to a stack trace in struct lock_class and struct
lock_list instead of storing 'nr_entries' and 'offset'.
This patch avoids that the following program triggers the "BUG:
MAX_STACK_TRACE_ENTRIES too low!" complaint:
#include <fcntl.h>
#include <unistd.h>
int main()
{
for (;;) {
int fd = open("/dev/infiniband/rdma_cm", O_RDWR);
close(fd);
}
}
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yuyang Du <duyuyang@gmail.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-4-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make it clear to humans and to the compiler that the stack trace
('entries') arguments are not modified.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-3-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch does not change the behavior of the lockdep code.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-2-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some hardware PMU drivers will override perf_event.cpu inside their
event_init callback. This causes a lockdep splat when initialized through
the kernel API:
WARNING: CPU: 0 PID: 250 at kernel/events/core.c:2917 ctx_sched_out+0x78/0x208
pc : ctx_sched_out+0x78/0x208
Call trace:
ctx_sched_out+0x78/0x208
__perf_install_in_context+0x160/0x248
remote_function+0x58/0x68
generic_exec_single+0x100/0x180
smp_call_function_single+0x174/0x1b8
perf_install_in_context+0x178/0x188
perf_event_create_kernel_counter+0x118/0x160
Fix this by calling perf_install_in_context with event->cpu, just like
perf_event_open
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Frank Li <Frank.li@nxp.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/c4ebe0503623066896d7046def4d6b1e06e0eb2e.1563972056.git.leonard.crestez@nxp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
An uninitialized/ zeroed mutex will go unnoticed because there is no
check for it. There is a magic check in the unlock's slowpath path which
might go unnoticed if the unlock happens in the fastpath.
Add a ->magic check early in the mutex_lock() and mutex_trylock() path.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190703092125.lsdf4gpsh2plhavb@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>