tick_do_timer_cpu is used lockless to check which CPU needs to take care
of the per tick timekeeping duty. This is done to avoid a thundering
herd problem on jiffies_lock.
The read and writes are not annotated so KCSAN complains about data races:
BUG: KCSAN: data-race in tick_nohz_idle_stop_tick / tick_nohz_next_event
write to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 26:
tick_nohz_idle_stop_tick+0x3b1/0x4a0
do_idle+0x1e3/0x250
read to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 16:
tick_nohz_next_event+0xe7/0x1e0
tick_nohz_get_sleep_length+0xa7/0xe0
menu_select+0x82/0xb90
cpuidle_select+0x44/0x60
do_idle+0x1c2/0x250
value changed: 0x0000001a -> 0xffffffff
Annotate them with READ/WRITE_ONCE() to document the intentional data race.
Reported-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Sean Anderson <sean.anderson@seco.com>
Link: https://lore.kernel.org/r/87cyqy7rt3.ffs@tglx
In perf_adjust_freq_unthr_context(), it first starts the event and then
stop unnecessarily to adjust the sampling frequency if the event is
throttled.
For a throttled non-frequency event, it doesn't have a freq so no need
to adjust. Just starting the event would be ok.
For a frequency event, whether it's throttled or not, it needs to stop
before adjusting the frequency. That means it should not start the
even if it was throttled. I tried to skip calling the stop callback,
but it didn't work well since the event count might not be up to date.
It should call the stop callback with PERF_EF_UPDATE anyway.
However not calling start would prevent unnecessary MSR accesses (which
can be costly) for already stopped events as stop state is saved in the
hw config.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20240207050545.2727923-2-namhyung@kernel.org
It was unnecessarily disabling and enabling PMUs for each event. It
should be done at PMU level. Add pmu_ctx->nr_freq counter to check it
at each PMU. As PMU context has separate active lists for pinned group
and flexible group, factor out a new function to do the job.
Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT
even if it needs to unthrottle sampling events.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Mingwei Zhang <mizhang@google.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20240207050545.2727923-1-namhyung@kernel.org
Support atomics in bpf_arena that can be JITed as a single x86 instruction.
Instructions that are JITed as loops are not supported at the moment,
since they require more complex extable and loop logic.
JITs can choose to do smarter things with bpf_jit_supports_insn().
Like arm64 may decide to support all bpf atomics instructions
when emit_lse_atomic is available and none in ll_sc mode.
bpf_jit_supports_percpu_insn(), bpf_jit_supports_ptr_xchg() and
other such callbacks can be replaced with bpf_jit_supports_insn()
in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240405231134.17274-1-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
If add_preferred_console() is not called early in setup_console(), we can
end up having register_console() call try_enable_default_console() before a
console device has called add_preferred_console().
Let's set console_set_on_cmdline flag in console_setup() to prevent this
from happening.
Signed-off-by: Tony Lindgren <tony@atomide.com>
Link: https://lore.kernel.org/r/20240327110021.59793-4-tony@atomide.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently console_setup() tries to make a console index out of any digits
passed in the kernel command line for console. In the DEVNAME:0.0 case,
the name can contain a device IO address, so bail out on console names
with a ':'.
Signed-off-by: Tony Lindgren <tony@atomide.com>
Link: https://lore.kernel.org/r/20240327110021.59793-3-tony@atomide.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Driver subsystems may need to translate the preferred console name to the
character device name used. We already do some of this in console_setup()
with a few hardcoded names, but that does not scale well.
The console options are parsed early in console_setup(), and the consoles
are added with __add_preferred_console(). At this point we don't know much
about the character device names and device drivers getting probed.
To allow driver subsystems to set up a preferred console, let's save the
kernel command line console options. To add a preferred console from a
driver subsystem with optional character device name translation, let's
add a new function add_preferred_console_match().
This allows the serial core layer to support console=DEVNAME:0.0 style
hardware based addressing in addition to the current console=ttyS0 style
naming. And we can start moving console_setup() character device parsing
to the driver subsystem specific code.
We use a separate array from the console_cmdline array as the character
device name and index may be unknown at the console_setup() time. And
eventually there's no need to call __add_preferred_console() until the
subsystem is ready to handle the console.
Adding the console name in addition to the character device name, and a
flag for an added console, could be added to the struct console_cmdline.
And the console_cmdline array handling could be modified accordingly. But
that complicates things compared saving the console options, and then
adding the consoles when the subsystems handling the consoles are ready.
Co-developed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240327110021.59793-2-tony@atomide.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, if a Kconfig option depends on TASKS_RCU, it conditionally does
"select TASKS_RCU if PREEMPTION". This works, but requires any change in
this enablement logic to be replicated across all such "select" clauses.
A new NEED_TASKS_RCU Kconfig option has been created to allow this
enablement logic to be in one place in kernel/rcu/Kconfig.
Therefore, make BPF select the new NEED_TASKS_RCU Kconfig option.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: <bpf@vger.kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Currently, all waits for grace periods sleep at TASK_UNINTERRUPTIBLE,
regardless of RCU flavor. This has worked well, but there have been
cases where a longer-than-average Tasks RCU grace period has triggered
softlockup splats, many of them, before the Tasks RCU CPU stall warning
appears. These softlockup splats unnecessarily consume console bandwidth
and complicate diagnosis of the underlying problem. Plus a long but not
pathologically long Tasks RCU grace period might trigger a few softlockup
splats before completing normally, which generates noise for no good
reason.
This commit therefore causes Tasks RCU grace periods to sleep at TASK_IDLE
priority. If there really is a persistent problem, the eventual Tasks
RCU CPU stall warning will flag it, and without the extra noise.
Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
It turns out that only one CPU at a time will ever invoke
rcu_torture_pipe_update_one() on a given rcu_torture structure.
This commit therefore adds three ASSERT_EXCLUSIVE_WRITER() calls
to enlist KCSAN's aid in checking this.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
If a callback flood prevents grace period from completing, rcutorture
does a WARN_ON(). Avoiding this WARN_ON() currently requires that at
least three grace periods elapse during an eight-second callback-flood
interval. Unfortunately, the current debug information does not include
anything about the grace-period state. This commit therefore adds a
call to cur_ops->gp_kthread_dbg(), if this function pointer is non-NULL.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
This commit adds the number of online CPUs to the state dump following
an unsuccesful callback-flood test.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
There is some indications that rcu_softirq_qs() might be more generally
used than anticipated. This commit therefore adds some lockdep assertions
and some cautionary tales in a new kernel-doc header.
Link: https://lore.kernel.org/all/Zd4DXTyCf17lcTfq@debian.debian/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Yan Zhai <yan@cloudflare.com>
Cc: <netdev@vger.kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Per filesystems/sysfs.rst, show() should only use sysfs_emit() or
sysfs_emit_at() when formatting the value to be returned to user space.
coccinelle complains that there are still a couple of functions that use
snprintf(). Convert them to sysfs_emit().
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240314100402.1326582-2-lizhijian@fujitsu.com
Per filesystems/sysfs.rst, show() should only use sysfs_emit() or
sysfs_emit_at() when formatting the value to be returned to user space.
coccinelle complains that there are still a couple of functions that use
snprintf(). Convert them to sysfs_emit().
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240314100402.1326582-1-lizhijian@fujitsu.com
Since commit 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside
get_online_cpus()"), cpuset hotplug was done asynchronously via a work
function. This is to avoid recursive locking of cgroup_mutex.
Since then, the cgroup locking scheme has changed quite a bit. A
cpuset_mutex was introduced to protect cpuset specific operations.
The cpuset_mutex is then replaced by a cpuset_rwsem. With commit
d74b27d63a ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock
order"), cpu_hotplug_lock is acquired before cpuset_rwsem. Later on,
cpuset_rwsem is reverted back to cpuset_mutex. All these locking changes
allow the hotplug code to call into cpuset core directly.
The following commits were also merged due to the asynchronous nature
of cpuset hotplug processing.
- commit b22afcdf04 ("cpu/hotplug: Cure the cpusets trainwreck")
- commit 50e7663233 ("sched/cpuset/pm: Fix cpuset vs. suspend-resume
bugs")
- commit 28b89b9e6f ("cpuset: handle race between CPU hotplug and
cpuset_hotplug_work")
Clean up all these bandages by making cpuset hotplug
processing synchronous again with the exception that the call to
cgroup_transfer_tasks() to transfer tasks out of an empty cgroup v1
cpuset, if necessary, will still be done via a work function due to the
existing cgroup_mutex -> cpu_hotplug_lock dependency. It is possible
to reverse that dependency, but that will require updating a number of
different cgroup controllers. This special hotplug code path should be
rarely taken anyway.
As all the cpuset states will be updated by the end of the hotplug
operation, we can revert most the above commits except commit
50e7663233 ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs")
which is partially reverted. Also removing some cpus_read_lock trylock
attempts in the cpuset partition code as they are no longer necessary
since the cpu_hotplug_lock is now held for the whole duration of the
cpuset hotplug code path.
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a function which allows to modify easily the EM after the new voltage
information is available. The device drivers for the chip can adjust
the voltage values after setup. The voltage for the same frequency in OPP
can be different due to chip binning. The voltage impacts the power usage
and the EM power values can be updated to reflect that.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Extract em_table_dup() and em_recalc_and_update() from
em_adjust_new_capacity(). Both functions will be later reused by the
'update EM due to chip binning' functionality.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
s2idle works like a regular suspend with freezing processes and freezing
devices. All CPUs except the control CPU go into idle. Once this is
completed the control CPU kicks all other CPUs out of idle, so that they
reenter the idle loop and then enter s2idle state. The control CPU then
issues an swait() on the suspend state and therefore enters the idle loop
as well.
Due to being kicked out of idle, the other CPUs leave their NOHZ states,
which means the tick is active and the corresponding hrtimer is programmed
to the next jiffie.
On entering s2idle the CPUs shut down their local clockevent device to
prevent wakeups. The last CPU which enters s2idle shuts down its local
clockevent and freezes timekeeping.
On resume, one of the CPUs receives the wakeup interrupt, unfreezes
timekeeping and its local clockevent and starts the resume process. At that
point all other CPUs are still in s2idle with their clockevents switched
off. They only resume when they are kicked by another CPU or after resuming
devices and then receiving a device interrupt.
That means there is no guarantee that all CPUs will wakeup directly on
resume. As a consequence there is no guarantee that timers which are queued
on those CPUs and should expire directly after resume, are handled. Also
timer list timers which are remotely queued to one of those CPUs after
resume will not result in a reprogramming IPI as the tick is
active. Queueing a hrtimer will also not result in a reprogramming IPI
because the first hrtimer event is already in the past.
The recent introduction of the timer pull model (7ee9887703 ("timers:
Implement the hierarchical pull model")) amplifies this problem, if the
current migrator is one of the non woken up CPUs. When a non pinned timer
list timer is queued and the queuing CPU goes idle, it relies on the still
suspended migrator CPU to expire the timer which will happen by chance.
The problem exists since commit 8d89835b04 ("PM: suspend: Do not pause
cpuidle in the suspend-to-idle path"). There the cpuidle_pause() call which
in turn invoked a wakeup for all idle CPUs was moved to a later point in
the resume process. This might not be reached or reached very late because
it waits on a timer of a still suspended CPU.
Address this by kicking all CPUs out of idle after the control CPU returns
from swait() so that they resume their timers and restore consistent system
state.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218641
Fixes: 8d89835b04 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Cc: 5.16+ <stable@kernel.org> # 5.16+
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Kernel timekeeping is designed to keep the change in cycles (since the last
timer interrupt) below max_cycles, which prevents multiplication overflow
when converting cycles to nanoseconds. However, if timer interrupts stop,
the clocksource_cyc2ns() calculation will eventually overflow.
Add protection against that. Simplify by folding together
clocksource_delta() and clocksource_cyc2ns() into cycles_to_nsec_safe().
Check against max_cycles, falling back to a slower higher precision
calculation.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-20-adrian.hunter@intel.com
For the case !CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE, forego overflow
protection in the range (mask << 1) < delta <= mask, and interpret it
always as an inconsistency between CPU clock values. That allows
slightly neater code, and it is on a slow path so has no effect on
performance.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-19-adrian.hunter@intel.com
Kernel timekeeping is designed to keep the change in cycles (since the last
timer interrupt) below max_cycles, which prevents multiplication overflow
when converting cycles to nanoseconds. However, if timer interrupts stop,
the calculation will eventually overflow.
Add protection against that. In timekeeping_cycles_to_ns() calculation,
check against max_cycles, falling back to a slower higher precision
calculation. In timekeeping_forward_now(), process delta in chunks of at
most max_cycles.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-18-adrian.hunter@intel.com
Open code clocksource_delta() in timekeeping_cycles_to_ns() so that
overflow safety can be added efficiently.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-17-adrian.hunter@intel.com
timekeeping_delta_to_ns() is now called only from
timekeeping_cycles_to_ns(), and it is not useful otherwise.
Simplify the code by folding it into timekeeping_cycles_to_ns().
No functional change.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-16-adrian.hunter@intel.com
Consolidate timekeeping helpers, making use of timekeeping_cycles_to_ns()
in preference to directly using timekeeping_delta_to_ns().
No functional change.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-15-adrian.hunter@intel.com
Simplify the usage of timekeeping sanity checking, in preparation for
consolidating timekeeping helpers. This works towards eliminating
timekeeping_delta_to_ns() in favour of timekeeping_cycles_to_ns().
No functional change.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-14-adrian.hunter@intel.com
Simplify __timekeeping_get_ns() by reusing timekeeping_cycles_to_ns().
No functional change.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-13-adrian.hunter@intel.com
Put together declaration and initialization of the local variable 'delta'.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-12-adrian.hunter@intel.com
Rename fast_tk_get_delta_ns() to __timekeeping_get_ns() to prepare for its
reuse as a general timekeeping helper function.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-11-adrian.hunter@intel.com
Move timekeeping helper functions to prepare for their reuse.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-10-adrian.hunter@intel.com
Add vdso_data::max_cycles in preparation to use it to detect potential
multiplication overflow.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240325064023.2997-7-adrian.hunter@intel.com
It's a bit hard to read the logic since the virq is used before checking it
for 0. Rearrange the code to make it better to understand.
This, in particular, should clearly answer the question whether the caller
needs to perform this check or not, and there are plenty of places for both
variants, confirming a confusion.
Fun fact that the new code is shorter:
Function old new delta
irq_dispose_mapping 278 271 -7
Total: Before=11625, After=11618, chg -0.06%
when compiled by GCC on Debian for x86_64.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240405190105.3932034-1-andriy.shevchenko@linux.intel.com
- Fix a timer migration bug that may result in missed events
- Fix timer migration group hierarchy event updates
- Fix a PowerPC64 build warning
- Fix a handful of DocBook annotation bugs
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYSUpsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h+/RAAlbYzlotBMM0cqxCng5jgetTT7EfQHXl1
zaqhx2FzEjoyhZ++kpBP03A42LumWz0TXTqRK+BicZIHWvIWz16w7xNr0dHo3+L8
PfPTZEPb1IwSP1FKHyzEZbVWPnHtokyJBky5Qp5IG5FoNqV1pArqeadyaSbd3hIw
A3l77wHCtXINkxjROs5EoJiOwVcJWigm4M7189EXDUKKr5nzE0hemNAKGnluQZxj
O5gF9vv40B38MLuo3xLDxFCrY8WDcq9yhv/AtBk+952FsceSZbH29zOt1a5l2HPb
yvBR4pMaS6x4UdzJeZTbdqDs8v9QWsCUc+qqeNYuFEJSBu9y7Qo5wec8c+Ptiu0E
1we/g4nWRaRnXvGyS1uj448jUZgnGu61KFbCCF+guDl94zKY6TBZfVpeWrF/Xjdr
Jq1K8zYMM/+hxlzqsVhoaL+2zAddUeWnwPcSC5J8mnVlyLJUd55Cd0OGcHimz3PV
QcimajOcE7e/pkw0eQnRQ6qAVeWXcJY4hWoJS9Nk8F9InfDC7I8T5NgsNVb6Edyx
fj2wE/K9lAfKevz49ieJ8ItIIus3Lzmi09pbfDmDP5J9iMyL6UMk2VXj8XAUvCdL
qpgigP1zcluwAFqHmaym6mUsej+VL/WqsKfy6Q8LI5yNvdYtUuzfQuqGqyOyGXX0
zJg6+qU7OAE=
=4VkW
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
"Fix various timer bugs:
- Fix a timer migration bug that may result in missed events
- Fix timer migration group hierarchy event updates
- Fix a PowerPC64 build warning
- Fix a handful of DocBook annotation bugs"
* tag 'timers-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Return early on deactivation
timers/migration: Fix ignored event due to missing CPU update
vdso: Use CONFIG_PAGE_SHIFT in vdso/datapage.h
timers: Fix text inconsistencies and spelling
tick/sched: Fix struct tick_sched doc warnings
tick/sched: Fix various kernel-doc warnings
timers: Fix kernel-doc format and add Return values
time/timekeeping: Fix kernel-doc warnings and typos
time/timecounter: Fix inline documentation
Currently, a set of core BPF kfuncs (e.g. bpf_task_*, bpf_cgroup_*,
bpf_cpumask_*, etc) cannot be invoked from BPF_PROG_TYPE_SYSCALL
programs. The whitelist approach taken for enabling kfuncs makes sense:
it not safe to call these kfuncs from every program type. For example,
it may not be safe to call bpf_task_acquire() in an fentry to
free_task().
BPF_PROG_TYPE_SYSCALL, on the other hand, is a perfectly safe program
type from which to invoke these kfuncs, as it's a very controlled
environment, and we should never be able to run into any of the typical
problems such as recursive invoations, acquiring references on freeing
kptrs, etc. Being able to invoke these kfuncs would be useful, as
BPF_PROG_TYPE_SYSCALL can be invoked with BPF_PROG_RUN, and would
therefore enable user space programs to synchronously call into BPF to
manipulate these kptrs.
This patch therefore enables invoking the aforementioned core kfuncs
from BPF_PROG_TYPE_SYSCALL progs.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240405143041.632519-2-void@manifault.com
Taking different maps within a single bpf_for_each_map_elem call is not
allowed before, because from the second map,
bpf_insn_aux_data->map_ptr_state will be marked as *poison*. In fact
both map_ptr and state are needed to support this use case: map_ptr is
used by set_map_elem_callback_state() while poison state is needed to
determine whether to use direct call.
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-3-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, bpf_insn_aux_data->map_ptr_state is used to store either
map_ptr or its poison state (i.e., BPF_MAP_PTR_POISON). Thus
BPF_MAP_PTR_POISON must be checked before reading map_ptr. In certain
cases, we may need valid map_ptr even in case of poison state.
This will be explained in next patch with bpf_for_each_map_elem()
helper.
This patch changes map_ptr_state into a new struct including both map
pointer and its state (poison/unpriv). It's in the same union with
struct bpf_loop_inline_state, so there is no extra memory overhead.
Besides, macros BPF_MAP_PTR_UNPRIV/BPF_MAP_PTR_POISON/BPF_MAP_PTR are no
longer needed.
This patch does not change any existing functionality.
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-2-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The newly added code to handle bpf_get_branch_snapshot fails to link when
CONFIG_PERF_EVENTS is disabled:
aarch64-linux-ld: kernel/bpf/verifier.o: in function `do_misc_fixups':
verifier.c:(.text+0x1090c): undefined reference to `__SCK__perf_snapshot_branch_stack'
Add a build-time check for that Kconfig symbol around the code to
remove the link time dependency.
Fixes: 314a53623c ("bpf: inline bpf_get_branch_snapshot() helper")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240405142637.577046-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 4b6f4c5a67 ("timer/migration: Remove buggy early return on
deactivation") removed the logic to return early in tmigr_update_events()
on deactivation. With this the problem with a not properly updated first
global event in a hierarchy containing only a single group was fixed.
But when having a look at this code path with a hierarchy with more than a
single level, now unnecessary work is done (example is partially copied
from the message of the commit mentioned above):
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = 0 migrator = NONE
active = 0 active = NONE
nextevt = T0i, T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
active idle idle idle
0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T0:0i, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". Without this
early return the following steps happen in tmigr_update_events() when
child = null and group = GRP0:0 :
lock(GRP0:0->lock);
timerqueue_del(GRP0:0, T0i);
unlock(GRP0:0->lock);
[GRP1:0]
migrator = NONE
active = NONE
nextevt = T0:0, T0:1
/ \
[GRP0:0] [GRP0:1]
migrator = NONE migrator = NONE
active = NONE active = NONE
nextevt = T1 nextevt = T2
/ \ / \
0 (T0i) 1 (T1) 2 (T2) 3
idle idle idle idle
2) The change now propagates up to the top. Then tmigr_update_events()
updates the group event of GRP0:0 and executes the following steps
(child = GRP0:0 and group = GRP0:0):
lock(GRP0:0->lock);
lock(GRP1:0->lock);
evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
in GRP0:0
... update GRP1:0 group event and timerqueue ...
unlock(GRP1:0->lock);
unlock(GRP0:0->lock);
So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
the timerqueue is redundand as this is done nevertheless in 2) when
tmigr_next_groupevt(GRP0:0) is executed.
Revert commit 4b6f4c5a67 ("timer/migration: Remove buggy early return on
deactivation") and add a condition into return path to skip the return
only, when hierarchy contains a single group. Adapt comments accordingly.
Fixes: 4b6f4c5a67 ("timer/migration: Remove buggy early return on deactivation")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/87cyr49on2.fsf@somnus
When a group event is updated with its expiry unchanged but a different
CPU, that target change may go unnoticed and the event may be propagated
up with a stale CPU value. The following depicts a scenario that has
been actually observed:
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = TGRP1:0 (T0)
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T0
/ \
0 (T0) 1 (T1)
idle idle
0) The hierarchy has 3 levels. The left part (GRP1:0) is all idle,
including CPU 0 and CPU 1 which have a timer each: T0 and T1. They have
the same expiry value.
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T0
/ \
0 (T0) 1 (T1)
idle idle
1) The migrator in GRP1:1 handles remotely T0. The event is dequeued
from the top and T0 executed.
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle
2) The migrator in GRP1:1 fetches the next timer for CPU 0 and finds
none. But it updates the events from its groups, starting with GRP0:0
which now has T1 as its next event. So far so good.
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle
3) The migrator in GRP1:1 proceeds upward and updates the events in
GRP1:0. The child event TGRP0:0 is found queued with the same expiry
as before. And therefore it is left unchanged. However the target CPU
is not the same but that fact is ignored so TGRP0:0 still points to
CPU 0 when it should point to CPU 1.
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = TGRP1:0 (T0)
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle
4) The propagation has reached the top level and TGRP1:0, having TGRP0:0
as its first event, also wrongly points to CPU 0. TGRP1:0 is added to
the top level group.
[GRP2:0]
migrator = GRP1:1
active = GRP1:1
nextevt = KTIME_MAX
/ \
[GRP1:0] [GRP1:1]
migrator = NONE [...]
active = NONE
nextevt = TGRP0:0 (T0)
/ \
[GRP0:0] [...]
migrator = NONE
active = NONE
nextevt = T1
/ \
0 1 (T1)
idle idle
5) The migrator in GRP1:1 dequeues the next event in top level pointing
to CPU 0. But since it actually doesn't see any real event in CPU 0, it
early returns.
6) T1 is left unhandled until either CPU 0 or CPU 1 wake up.
Some other bad scenario may involve trees with just two levels.
Fix this with unconditionally updating the CPU of the child event before
considering to early return while updating a queued event with an
unchanged expiry value.
Fixes: 7ee9887703 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/Zg2Ct6M2RJAYHgCB@localhost.localdomain
r10 is a special register that is not under BPF program's control and is
always effectively precise. The rest of precision logic assumes that
only r0-r9 SCALAR registers are marked as precise, so prevent r10 from
being marked precise.
This can happen due to signed cast instruction allowing to do something
like `r0 = (s8)r10;`, which later, if r0 needs to be precise, would lead
to an attempt to mark r10 as precise.
Prevent this with an extra check during instruction backtracking.
Fixes: 8100928c88 ("bpf: Support new sign-extension mov insns")
Reported-by: syzbot+148110ee7cf72f39f33e@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404214536.3551295-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fairly usual collection of driver and core fixes. The large selftest
accompanying one of the fixes is also becoming a common occurrence.
Current release - regressions:
- ipv6: fix infinite recursion in fib6_dump_done()
- net/rds: fix possible null-deref in newly added error path
Current release - new code bugs:
- net: do not consume a full cacheline for system_page_pool
- bpf: fix bpf_arena-related file descriptor leaks in the verifier
- drv: ice: fix freeing uninitialized pointers, fixing misuse of
the newfangled __free() auto-cleanup
Previous releases - regressions:
- x86/bpf: fixes the BPF JIT with retbleed=stuff
- xen-netfront: add missing skb_mark_for_recycle, fix page pool
accounting leaks, revealed by recently added explicit warning
- tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
non-wildcard addresses
- Bluetooth:
- replace "hci_qca: Set BDA quirk bit if fwnode exists in DT"
with better workarounds to un-break some buggy Qualcomm devices
- set conn encrypted before conn establishes, fix re-connecting
to some headsets which use slightly unusual sequence of msgs
- mptcp:
- prevent BPF accessing lowat from a subflow socket
- don't account accept() of non-MPC client as fallback to TCP
- drv: mana: fix Rx DMA datasize and skb_over_panic
- drv: i40e: fix VF MAC filter removal
Previous releases - always broken:
- gro: various fixes related to UDP tunnels - netns crossing problems,
incorrect checksum conversions, and incorrect packet transformations
which may lead to panics
- bpf: support deferring bpf_link dealloc to after RCU grace period
- nf_tables:
- release batch on table validation from abort path
- release mutex after nft_gc_seq_end from abort path
- flush pending destroy work before exit_net release
- drv: r8169: skip DASH fw status checks when DASH is disabled
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmYO91wACgkQMUZtbf5S
IrvHBQ/+PH/hobI+o3aLqwtdVlyxhmA31bVQ0I3aTIZV7c3ideMBcfgYa8TiZM2g
pLiBiWoJXCN0h33wgUmlUee+sBvpoPCdPjGD/g99OJyKWjVt2D7ObnSwxMfjHUoq
dtcN2JupqHP0SHz6wPPCmnWtTLxSGUsDdKjmkHQcCRhQIGTYFkYyHcOmPgNbBjaB
6jvmH1kE9WQTFD8QcOMaZmXQ5omoafpxxQLsgundtOWxPWHL7XNvk0B5k/ESDRG1
ujbxwtNnOESzpxZMQ6OyZlsnN/1tWfnEvLJFYVwf9BMrOlahJT/f5b/EJ9/Xy4dC
zkAp7Tul3uAvNRKhBNhVBTWQbnIykmiNMp1VBFmiScQAy8hcnX+6d4LKTIHxbXZK
V3AqcUS6YU2nyMdLRkhvq9f3uxD6hcY19gQdyqgCUPOtyUAs/JPv7lXQjCuuEqkq
urEZkigUApnEqPIrIqANJ7nXUy3U0K8qU6evOZoGZ5OdiKeNKC3+tIr+g2f1ZUZq
a7Dkat7JH9WQ7IG8Geody6Z30K9EpSqYMTKzB5wTfmuqw6cV8bl9OAW9UOSRK0GL
pyG8GwpkpFPkNiZdu9Zt44Pno5xdLIa1+C3QZR0r5CJWYAzCbI80MppP5veF9Mw+
v+2v8iBWuh9iv0AUj9KJOwG5QQ+EXLUuSlhtx/DFnmn2CJ9plXI=
=6bQI
-----END PGP SIGNATURE-----
Merge tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, bluetooth and bpf.
Fairly usual collection of driver and core fixes. The large selftest
accompanying one of the fixes is also becoming a common occurrence.
Current release - regressions:
- ipv6: fix infinite recursion in fib6_dump_done()
- net/rds: fix possible null-deref in newly added error path
Current release - new code bugs:
- net: do not consume a full cacheline for system_page_pool
- bpf: fix bpf_arena-related file descriptor leaks in the verifier
- drv: ice: fix freeing uninitialized pointers, fixing misuse of the
newfangled __free() auto-cleanup
Previous releases - regressions:
- x86/bpf: fixes the BPF JIT with retbleed=stuff
- xen-netfront: add missing skb_mark_for_recycle, fix page pool
accounting leaks, revealed by recently added explicit warning
- tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
non-wildcard addresses
- Bluetooth:
- replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with
better workarounds to un-break some buggy Qualcomm devices
- set conn encrypted before conn establishes, fix re-connecting to
some headsets which use slightly unusual sequence of msgs
- mptcp:
- prevent BPF accessing lowat from a subflow socket
- don't account accept() of non-MPC client as fallback to TCP
- drv: mana: fix Rx DMA datasize and skb_over_panic
- drv: i40e: fix VF MAC filter removal
Previous releases - always broken:
- gro: various fixes related to UDP tunnels - netns crossing
problems, incorrect checksum conversions, and incorrect packet
transformations which may lead to panics
- bpf: support deferring bpf_link dealloc to after RCU grace period
- nf_tables:
- release batch on table validation from abort path
- release mutex after nft_gc_seq_end from abort path
- flush pending destroy work before exit_net release
- drv: r8169: skip DASH fw status checks when DASH is disabled"
* tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
netfilter: validate user input for expected length
net/sched: act_skbmod: prevent kernel-infoleak
net: usb: ax88179_178a: avoid the interface always configured as random address
net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45()
net: ravb: Always update error counters
net: ravb: Always process TX descriptor ring
netfilter: nf_tables: discard table flag update with pending basechain deletion
netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
netfilter: nf_tables: reject new basechain after table flag update
netfilter: nf_tables: flush pending destroy work before exit_net release
netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
netfilter: nf_tables: release batch on table validation from abort path
Revert "tg3: Remove residual error handling in tg3_suspend"
tg3: Remove residual error handling in tg3_suspend
net: mana: Fix Rx DMA datasize and skb_over_panic
net/sched: fix lockdep splat in qdisc_tree_reduce_backlog()
net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping
net: stmmac: fix rx queue priority assignment
net: txgbe: fix i2c dev name cannot match clkdev
net: fec: Set mac_managed_pm during probe
...
Inline bpf_get_branch_snapshot() helper using architecture-agnostic
inline BPF code which calls directly into underlying callback of
perf_snapshot_branch_stack static call. This callback is set early
during kernel initialization and is never updated or reset, so it's ok
to fetch actual implementation using static_call_query() and call
directly into it.
This change eliminates a full function call and saves one LBR entry
in PERF_SAMPLE_BRANCH_ANY LBR mode.
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404002640.1774210-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
perf_snapshot_branch_stack is set up in an architecture-agnostic way, so
there is no reason for BPF subsystem to keep track of which
architectures do support LBR or not. E.g., it looks like ARM64 might soon
get support for BRBE ([0]), which (with proper integration) should be
possible to utilize using this BPF helper.
perf_snapshot_branch_stack static call will point to
__static_call_return0() by default, which just returns zero, which will
lead to -ENOENT, as expected. So no need to guard anything here.
[0] https://lore.kernel.org/linux-arm-kernel/20240125094119.2542332-1-anshuman.khandual@arm.com/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404002640.1774210-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In preparation to passing an object pointer to fsnotify_find_mark(), add
a wrapper fsnotify_find_inode_mark() and use it where possible.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20240317184154.1200192-4-amir73il@gmail.com>
Using new per-CPU BPF instruction, partially inline
bpf_map_lookup_elem() helper for per-CPU hashmap BPF map. Just like for
normal HASH map, we still generate a call into __htab_map_lookup_elem(),
but after that we resolve per-CPU element address using a new
instruction, saving on extra functions calls.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>