1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

44789 commits

Author SHA1 Message Date
Qais Yousef
58eeb2d79b sched/fair: Don't double balance_interval for migrate_misfit
It is not necessarily an indication of the system being busy and
requires a backoff of the load balancer activities. But pushing it high
could mean generally delaying other misfit activities or other type of
imbalances.

Also don't pollute nr_balance_failed because of misfit failures. The
value is used for enabling cache hot migration and in migrate_util/load
types. None of which should be impacted (skewed) by misfit failures.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240324004552.999936-5-qyousef@layalina.io
2024-03-25 12:09:57 +01:00
Qais Yousef
fa427e8e53 sched/topology: Remove root_domain::max_cpu_capacity
The value is no longer used as we now keep track of max_allowed_capacity
for each task instead.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240324004552.999936-4-qyousef@layalina.io
2024-03-25 12:09:56 +01:00
Qais Yousef
22d5607400 sched/fair: Check if a task has a fitting CPU when updating misfit
If a misfit task is affined to a subset of the possible CPUs, we need to
verify that one of these CPUs can fit it. Otherwise the load balancer
code will continuously trigger needlessly leading the balance_interval
to increase in return and eventually end up with a situation where real
imbalances take a long time to address because of this impossible
imbalance situation.

This can happen in Android world where it's common for background tasks
to be restricted to little cores.

Similarly if we can't fit the biggest core, triggering misfit is
pointless as it is the best we can ever get on this system.

To be able to detect that; we use asym_cap_list to iterate through
capacities in the system to see if the task is able to run at a higher
capacity level based on its p->cpus_ptr. We do that when the affinity
change, a fair task is forked, or when a task switched to fair policy.
We store the max_allowed_capacity in task_struct to allow for cheap
comparison in the fast path.

Improve check_misfit_status() function by removing redundant checks.
misfit_task_load will be 0 if the task can't move to a bigger CPU. And
nohz_balancer_kick() already checks for cpu_check_capacity() before
calling check_misfit_status().

Test:
=====

Add

	trace_printk("balance_interval = %lu\n", interval)

in get_sd_balance_interval().

run
	if [ "$MASK" != "0" ]; then
		adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
	fi
	sleep 10
	// parse ftrace buffer counting the occurrence of each valaue

Where MASK is either:

	* 0: no busy task running
	* 1: busy task is pinned to 1 cpu; handled today to not cause
	  misfit
	* f: busy task pinned to little cores, simulates busy background
	  task, demonstrates the problem to be fixed

Results:
========

Note how occurrence of balance_interval = 128 overshoots for MASK = f.

BEFORE
------

	MASK=0

		   1 balance_interval = 175
		 120 balance_interval = 128
		 846 balance_interval = 64
		  55 balance_interval = 63
		 215 balance_interval = 32
		   2 balance_interval = 31
		   2 balance_interval = 16
		   4 balance_interval = 8
		1870 balance_interval = 4
		  65 balance_interval = 2

	MASK=1

		  27 balance_interval = 175
		  37 balance_interval = 127
		 840 balance_interval = 64
		 167 balance_interval = 63
		 449 balance_interval = 32
		  84 balance_interval = 31
		 304 balance_interval = 16
		1156 balance_interval = 8
		2781 balance_interval = 4
		 428 balance_interval = 2

	MASK=f

		   1 balance_interval = 175
		1328 balance_interval = 128
		  44 balance_interval = 64
		 101 balance_interval = 63
		  25 balance_interval = 32
		   5 balance_interval = 31
		  23 balance_interval = 16
		  23 balance_interval = 8
		4306 balance_interval = 4
		 177 balance_interval = 2

AFTER
-----

Note how the high values almost disappear for all MASK values. The
system has background tasks that could trigger the problem without
simulate it even with MASK=0.

	MASK=0

		 103 balance_interval = 63
		  19 balance_interval = 31
		 194 balance_interval = 8
		4827 balance_interval = 4
		 179 balance_interval = 2

	MASK=1

		 131 balance_interval = 63
		   1 balance_interval = 31
		  87 balance_interval = 8
		3600 balance_interval = 4
		   7 balance_interval = 2

	MASK=f

		   8 balance_interval = 127
		 182 balance_interval = 63
		   3 balance_interval = 31
		   9 balance_interval = 16
		 415 balance_interval = 8
		3415 balance_interval = 4
		  21 balance_interval = 2

Signed-off-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240324004552.999936-3-qyousef@layalina.io
2024-03-25 12:09:54 +01:00
Qais Yousef
77222b0d12 sched/topology: Export asym_cap_list
So that we can use it to iterate through available capacities in the
system. Sort asym_cap_list in descending order as expected users are
likely to be interested on the highest capacity first.

Make the list RCU protected to allow for cheap access in hot paths.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240324004552.999936-2-qyousef@layalina.io
2024-03-25 12:09:53 +01:00
Ingo Molnar
f4566a1e73 Linux 6.9-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmYAlq0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGYqwH/0fb4pRbVtULpiIK
 Cs7/e/IWzRRWLBq+Jj2KVVTxwjyiKFNOq6K/CHHnljIWo1yN2CIWeOgbHfTI0WfN
 xmBdJP7OtK8MCN9PwwoWhZxMLcyv4pFCERrrkGa7AD+cdN4j/ytQ3mH5V8f/21fd
 rnpQSdpgGXB2SSMHd520Y+e56+gxrrTmsDXjZWM08Wt0bbqAWJrjNe58BMz5hI1t
 yQtcgYRTdUuZBn5TMkT99lK9EFQslV38YCo7RUP5D0DWXS1jSfWlgnCD1Nc1ziF4
 ps/xPdUMDJAc5Tslg/hgJOciSuLqgMzIUsVgZrKysuu3NhwDY1LDWGORmH1t8E8W
 RC25950=
 =F+01
 -----END PGP SIGNATURE-----

Merge tag 'v6.9-rc1' into sched/core, to pick up fixes and to refresh the branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-03-25 11:32:29 +01:00
Masami Hiramatsu (Google)
0add699ad0 tracing: probes: Fix to zero initialize a local variable
Fix to initialize 'val' local variable with zero.
Dan reported that Smatch static code checker reports an error that a local
'val' variable needs to be initialized. Actually, the 'val' is expected to
be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
initialize it with zero.

Link: https://lore.kernel.org/all/171092223833.237219.17304490075697026697.stgit@devnote2/

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/b010488e-68aa-407c-add0-3e059254aaa0@moroto.mountain/
Fixes: 25f00e40ce ("tracing/probes: Support $argN in return probe (kprobe and fprobe)")
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-03-25 16:24:31 +09:00
Linus Torvalds
864ad046c1 dma-mapping fixes for Linux 6.9
This has a set of swiotlb alignment fixes for sometimes very long
 standing bugs from Will.  We've been discussion them for a while and they
 should be solid now.
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmX/bmILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOuKQ//cUR3EywszAc04x8dIYsfegFGdQxUeJD0+1elAPss
 ELiqrlg5A/Yn4uHKpXjWbvJ+v1Ywh3o8+vlgUiG4aFeg4xEd+FsJqm2SDa3jhdMP
 2hV8pwB92kpkKCxyCAqx8O/4o4fY++KCFsOtnammEudFjurJaCrRTlauOn6D1t/i
 JsBYCFtjFIhIPHQe7jmZ6dNiLEfiIJ+q8ImW+UxuB+gOGgU8C4VVW3tHuo3KeU7n
 yVOcz4yJrQ4xYzG3RKtaU0FE0ybA860xwiA5oPvqpI9A2ISGovv7ik0QCUlHXhff
 z+iL8Lj/KsOucq5pBDhbRYeN2n4VVogEwb/hut6mgyqj1ESjqeZaLioVHqOTDbmB
 +vNTVBt6OGTOq1YkNKttK9vBBXs5RdZSBalzBG/QO1ewmrNVVZ7z8fWXVRDipoIl
 sAIXmI8xAy5TNL6UbJ+RDfYeLlTzHjXGKQGB49gumOA8s4w5P5v9diYegX6GcVZV
 PKkYLOvprwcyi8Xxx2mNxFDxh+LWqzMYqzwsN7AoRTW4TRc7Tel0G6Axs+V/cL/Y
 23IHfFfT2HqDUM5PuBfUcgCrtw1hinuD80xqXVcvaU+AYoQhrGHJFLHkj6lTwV2b
 hmuul170froI2A/vm8yGGqcn2Me55AexlpMab+UWL+iisGtqFTWi9b9vK/2Vi+Zj
 wBg=
 =Xaob
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
 "This has a set of swiotlb alignment fixes for sometimes very long
  standing bugs from Will. We've been discussion them for a while and
  they should be solid now"

* tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE
  iommu/dma: Force swiotlb_max_mapping_size on an untrusted device
  swiotlb: Fix alignment checks when both allocation and DMA masks are present
  swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()
  swiotlb: Enforce page alignment in swiotlb_alloc()
  swiotlb: Fix double-allocation of slots due to broken alignment handling
2024-03-24 10:45:31 -07:00
Linus Torvalds
70293240c5 Two regression fixes for the timer and timer migration code:
1) Prevent endless timer requeuing which is caused by two CPUs racing out
      of idle. This happens when the last CPU goes idle and therefore has to
      ensure to expire the pending global timers and some other CPU come out
      of idle at the same time and the other CPU wins the race and expires
      the global queue. This causes the last CPU to chase ghost timers
      forever and reprogramming it's clockevent device endlessly.
 
      Cure this by re-evaluating the wakeup time unconditionally.
 
   2) The split into local (pinned) and global timers in the timer wheel
      caused a regression for NOHZ full as it broke the idle tracking of
      global timers. On NOHZ full this prevents an self IPI being sent which
      in turn causes the timer to be not programmed and not being expired on
      time.
 
      Restore the idle tracking for the global timer base so that the self
      IPI condition for NOHZ full is working correctly again.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmX/Mn8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYX0D/93fKa9qp+qBs72Vvctj4bJuk6auel7
 GRWx3Vc//kk4VOJFCteQ2eykr1/fsLuibPb67iiKp41stvaWKeJPj0kD+RUuVf8E
 dPGPpPU+qY5ynhoiqJekZL7+5NSA48y4bDc00a4U31MPpEcJB5y94zFOCKMiCWtk
 6Tf6I6168bsSFYvKqb2LImVoowu/bf7bXLVUk1HcdNnSC7bfx+yN8nkQ1zSy3K2M
 IyE1CQqMyDmdfKW9Vs68ooTIpuA0n7bxOuXbVaFdJyiJ035v3Z3+m2vQmrHHLDdz
 MfqHbFDEomDC+zfiugFvuxyxLIi2Gf/NXPibu6OxLkVe2Pu1KUJkhFbgZVUR2W6A
 EU6SmZr77zMPAMZyqG8OJqTqlCPiJfJX2KMWDF+ezbXBt+sbMe6LfazPqj9TopnN
 /ECMCt77xl1POCEnP81hPWizKsqf8HCTDDZEi9UlqbIxT3TgrZFlTnKfmmciwWiP
 uGoUXgZmi8qJ+lSGNTVUgbTmIbazvUz43sKgfndUy2yxeCb/SBlx1/8Ys2ntszOy
 XDOF8QroPH0zXlBaYo0QVZbOdB4O0/En1qZuGScBmoUY7bRr0NRD/C3ObhvQHI7C
 iguAsnB+zirwwZSTDzwhQDhXAtWgSaqBB7pb4aCDxC0AvnKtL5HKdDNeIafpPJeA
 4Xh40iu44u/V9w==
 =lY5O
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
 "Two regression fixes for the timer and timer migration code:

   - Prevent endless timer requeuing which is caused by two CPUs racing
     out of idle. This happens when the last CPU goes idle and therefore
     has to ensure to expire the pending global timers and some other
     CPU come out of idle at the same time and the other CPU wins the
     race and expires the global queue. This causes the last CPU to
     chase ghost timers forever and reprogramming it's clockevent device
     endlessly.

     Cure this by re-evaluating the wakeup time unconditionally.

   - The split into local (pinned) and global timers in the timer wheel
     caused a regression for NOHZ full as it broke the idle tracking of
     global timers. On NOHZ full this prevents an self IPI being sent
     which in turn causes the timer to be not programmed and not being
     expired on time.

     Restore the idle tracking for the global timer base so that the
     self IPI condition for NOHZ full is working correctly again"

* tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Fix removed self-IPI on global timer's enqueue in nohz_full
  timers/migration: Fix endless timer requeue after idle interrupts
2024-03-23 14:49:25 -07:00
Linus Torvalds
976b029d06 A single fix for the generic entry code:
THe trace_sys_enter() tracepoint can modify the syscall number via
   kprobes or BPF in pt_regs, but that requires that the syscall number is
   re-evaluted from pt_regs after the tracepoint.
 
   A seccomp fix in that area removed the re-evaluation so the change does
   not take effect as the code just uses the locally cached number.
 
   Restore the original behaviour by re-evaluating the syscall number after
   the tracepoint.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmX/LJoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodkJEACFbJqxdlpoWvc8lpYd0htqP+EyN6li
 quMk+acE8IOtiqOetphOrmKGzBYdtpNu+7uWVeRTyCarm2Gb5lX1GnIMwyEK8R25
 R+fVbPaECldSUKypAbTze2RpJl/vp2jeKhCpOn3A5MpITd0vV9SEs+9hecBz3ucZ
 CXB4vPR7n8VxHX5ArbHqe05fF5TnSZ4CnpeWONQghGsEaSJ87XrljVfC/fxmb7mZ
 yB2KAIKeYqslvoFoIWSAJk6xvwcGC08vY8mxkM6n61yXmAZ33BWvpnafGHsstF/V
 gT9Kmvs7F6Tn6V5j0SUuC4/nUO/dtEDhvra793OSvR/rB3+AmZfECPQaDNIJqNaG
 0pG558E+hrnpV/YWGzRYJCtqUgtTS0kV5JJY9ffoJRQB3Nb7GCeyykDcs823Yt7j
 A5o3JUQU0kW4X08hGFMU6DSFppxp9nS0eXDoKMlVRmDB8l30bLK+tMPMaID6yGse
 0UafVMWRhtM85mqNVZIZKTsRuONvT+8PV2UH5cNryGmjjbykLmjSLFRvXBmk2Jl+
 uhcc0QaApQleA/H/fSNoyPFOPCqvDBEo4xM9e3ypc8TflOQegZfpzrjAITHzv9hY
 r4ci8faKXTR7jY70a6TiEgcGYYLYDmyvOL0I+TwYA9qHB7tEaJMF6YfU8UDsM9K2
 Gb4l6n8fNQubTg==
 =iW0L
 -----END PGP SIGNATURE-----

Merge tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core entry fix from Thomas Gleixner:
 "A single fix for the generic entry code:

  The trace_sys_enter() tracepoint can modify the syscall number via
  kprobes or BPF in pt_regs, but that requires that the syscall number
  is re-evaluted from pt_regs after the tracepoint.

  A seccomp fix in that area removed the re-evaluation so the change
  does not take effect as the code just uses the locally cached number.

  Restore the original behaviour by re-evaluating the syscall number
  after the tracepoint"

* tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Respect changes to system call number by trace_sys_enter()
2024-03-23 14:17:37 -07:00
Puranjay Mohan
122fdbd2a0 bpf: verifier: reject addr_space_cast insn without arena
The verifier allows using the addr_space_cast instruction in a program
that doesn't have an associated arena. This was caught in the form an
invalid memory access in do_misc_fixups() when while converting
addr_space_cast to a normal 32-bit mov, env->prog->aux->arena was
dereferenced to check for BPF_F_NO_USER_CONV flag.

Reject programs that include the addr_space_cast instruction but don't
have an associated arena.

root@rv-tester:~# ./reproducer
 Unable to handle kernel access to user memory without uaccess routines at virtual address 0000000000000030
 Oops [#1]
 [<ffffffff8017eeaa>] do_misc_fixups+0x43c/0x1168
 [<ffffffff801936d6>] bpf_check+0xda8/0x22b6
 [<ffffffff80174b32>] bpf_prog_load+0x486/0x8dc
 [<ffffffff80176566>] __sys_bpf+0xbd8/0x214e
 [<ffffffff80177d14>] __riscv_sys_bpf+0x22/0x2a
 [<ffffffff80d2493a>] do_trap_ecall_u+0x102/0x17c
 [<ffffffff80d3048c>] ret_from_exception+0x0/0x64

Fixes: 6082b6c328 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yue sun <samsun1006219@gmail.com>
Closes: https://lore.kernel.org/bpf/CABOYnLz09O1+2gGVJuCxd_24a-7UueXzV-Ff+Fr+h5EKFDiYCQ@mail.gmail.com/
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240322153518.11555-1-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-22 20:44:09 -07:00
Puranjay Mohan
f7f5d1808b bpf: verifier: fix addr_space_cast from as(1) to as(0)
The verifier currently converts addr_space_cast from as(1) to as(0) that
is: BPF_ALU64 | BPF_MOV | BPF_X with off=1 and imm=1
to
BPF_ALU | BPF_MOV | BPF_X with imm=1 (32-bit mov)

Because of this imm=1, the JITs that have bpf_jit_needs_zext() == true,
interpret the converted instruction as BPF_ZEXT_REG(DST) which is a
special form of mov32, used for doing explicit zero extension on dst.
These JITs will just zero extend the dst reg and will not move the src to
dst before the zext.

Fix do_misc_fixups() to set imm=0 when converting addr_space_cast to a
normal mov32.

The JITs that have bpf_jit_needs_zext() == true rely on the verifier to
emit zext instructions. Mark dst_reg as subreg when doing cast from
as(1) to as(0) so the verifier emits a zext instruction after the mov.

Fixes: 6082b6c328 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240321153939.113996-1-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-22 20:36:36 -07:00
Linus Torvalds
c150b809f7 RISC-V Patches for the 6.9 Merge Window
* Support for various vector-accelerated crypto routines.
 * Hibernation is now enabled for portable kernel builds.
 * mmap_rnd_bits_max is larger on systems with larger VAs.
 * Support for fast GUP.
 * Support for membarrier-based instruction cache synchronization.
 * Support for the Andes hart-level interrupt controller and PMU.
 * Some cleanups around unaligned access speed probing and Kconfig
   settings.
 * Support for ACPI LPI and CPPC.
 * Various cleanus related to barriers.
 * A handful of fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmX9icgTHHBhbG1lckBk
 YWJiZWx0LmNvbQAKCRAuExnzX7sYib+UD/4xyL6UMixx6A06BVBL9UT4vOrxRvNr
 JIihG5y5QNMjes9DHWL35mZTMqFtQ0tq94ViWFLmJWloV/8KRVM2C9R9KX7vplf3
 M/OwvP106spxgvNHoeQbycgs42RU1t2mpqT7N1iK2hCjqieP3vLn6hsSLXWTAG0L
 3gQbQw6XCLC3hPyLq+nbFY2i4faeCmpXWmixoy/IvQ5calZQrRU0LNlP6lcMBhVo
 uocjG0uGAhrahw2s81jxcMZcxa3AvUCiplapdD5H5v9rBM85SkYJj2Q9SqdSorkb
 xzuimRnKPI5s47yM3pTfZY0qnQUYHV7PXXuw4WujpCQVQdhaG+Ggq63UUZA61J9t
 IzZK2zdcfHqICrGTtXImUzRT3dcc3oq+IFq4tTY+rEJm29hrXkAtx+qBm5xtMvax
 fJz5feJ/iT0u7MDj4Oq24n+Kpl+Olm+MJaZX3m5Ovi/9V6a9iK9HXqxg9/Fs0fMO
 +J/0kTgd8Vu9CYH7KNWz3uztcO9eMAH3VyzuXuab4BGj1i1Y/9EjpALQi7rDN73S
 OsYQX6NnzMkBV4dvElJVLXiPlvNlMHZZwdak5CqPb48jaJu6iiIZAuvOrG6/naGP
 wnQSLVA2WWWoOkl3AJhxfpa11CLhbMl9E2gYm1VtNvASXoSFIxlAq1Yv3sG8yjty
 4ZT0rYFJOstYiQ==
 =3dL5
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V updates from Palmer Dabbelt:

 - Support for various vector-accelerated crypto routines

 - Hibernation is now enabled for portable kernel builds

 - mmap_rnd_bits_max is larger on systems with larger VAs

 - Support for fast GUP

 - Support for membarrier-based instruction cache synchronization

 - Support for the Andes hart-level interrupt controller and PMU

 - Some cleanups around unaligned access speed probing and Kconfig
   settings

 - Support for ACPI LPI and CPPC

 - Various cleanus related to barriers

 - A handful of fixes

* tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (66 commits)
  riscv: Fix syscall wrapper for >word-size arguments
  crypto: riscv - add vector crypto accelerated AES-CBC-CTS
  crypto: riscv - parallelize AES-CBC decryption
  riscv: Only flush the mm icache when setting an exec pte
  riscv: Use kcalloc() instead of kzalloc()
  riscv/barrier: Add missing space after ','
  riscv/barrier: Consolidate fence definitions
  riscv/barrier: Define RISCV_FULL_BARRIER
  riscv/barrier: Define __{mb,rmb,wmb}
  RISC-V: defconfig: Enable CONFIG_ACPI_CPPC_CPUFREQ
  cpufreq: Move CPPC configs to common Kconfig and add RISC-V
  ACPI: RISC-V: Add CPPC driver
  ACPI: Enable ACPI_PROCESSOR for RISC-V
  ACPI: RISC-V: Add LPI driver
  cpuidle: RISC-V: Move few functions to arch/riscv
  riscv: Introduce set_compat_task() in asm/compat.h
  riscv: Introduce is_compat_thread() into compat.h
  riscv: add compile-time test into is_compat_task()
  riscv: Replace direct thread flag check with is_compat_task()
  riscv: Improve arch_get_mmap_end() macro
  ...
2024-03-22 10:41:13 -07:00
Eric Biggers
203a6763ab Revert "crypto: pkcs7 - remove sha1 support"
This reverts commit 16ab7cb582 because it
broke iwd.  iwd uses the KEYCTL_PKEY_* UAPIs via its dependency libell,
and apparently it is relying on SHA-1 signature support.  These UAPIs
are fairly obscure, and their documentation does not mention which
algorithms they support.  iwd really should be using a properly
supported userspace crypto library instead.  Regardless, since something
broke we have to revert the change.

It may be possible that some parts of this commit can be reinstated
without breaking iwd (e.g. probably the removal of MODULE_SIG_SHA1), but
for now this just does a full revert to get things working again.

Reported-by: Karel Balej <balejk@matfyz.cz>
Closes: https://lore.kernel.org/r/CZSHRUIJ4RKL.34T4EASV5DNJM@matfyz.cz
Cc: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Karel Balej <balejk@matfyz.cz>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-03-22 19:42:20 +08:00
Valentin Schneider
4eab44a8ae context_tracking: Make context_tracking_key __ro_after_init
context_tracking_key is only ever enabled in __init ct_cpu_tracker_user(),
so mark it as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240313180106.2917308-3-vschneid@redhat.com
2024-03-22 11:18:18 +01:00
Peter Zijlstra
91a1d97ef4 jump_label,module: Don't alloc static_key_mod for __ro_after_init keys
When a static_key is marked ro_after_init, its state will never change
(after init), therefore jump_label_update() will never need to iterate
the entries, and thus module load won't actually need to track this --
avoiding the static_key::next write.

Therefore, mark these keys such that jump_label_add_module() might
recognise them and avoid the modification.

Use the special state: 'static_key_linked(key) && !static_key_mod(key)'
to denote such keys.

jump_label_add_module() does not exist under CONFIG_JUMP_LABEL=n, so the
newly-introduced jump_label_init_ro() can be defined as a nop for that
configuration.

[ mingo: Renamed jump_label_ro() to jump_label_init_ro() ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240313180106.2917308-2-vschneid@redhat.com
2024-03-22 11:18:16 +01:00
Linus Torvalds
3faae16b5a RTC for 6.9
Subsytem:
  - rtc_class is now const
 
 Drivers:
  - ds1511: driver cleanup, set date and time range and alarm offset limit
  - max31335: fix interrupt handler
  - pcf8523: improve suspend support
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEBqsFVZXh8s/0O5JiY6TcMGxwOjIFAmX8uCcACgkQY6TcMGxw
 OjI1hA//fK1Cs2eDL2E7e8bQCJtDJ+gdzVT210LwT9TCOgUV+arqLjF/i/pq8m5j
 CK/ukk5OLZL3v5dS7rsOZAiTg1N9Q+3SEoaTP029kRfA2SxOE2isR9W5nNAvyJC0
 L4Y4YEcEMsJpluK94G1/4ult1We9vCGSlqlNknChp3BL5ELxg7T/Ea5wda2OLSRx
 +ZgQpB6O+LIIQucm8mgCAlucMJrAj4+qEF7HO0AeElDh+md4OvjrF8Y30vTfs82L
 hGazWQgu8J4Jmy4u6q00s5IhfsvwNZVYgaASv8ivB8/9qgQLktZ6HFHRSwloRXg2
 o9OqcM/OR7Lw4taH/6pnVQ37UUAQ/Vy9QJy2GHza1uDzlBBjtqVz8zFqheu/rmnJ
 3AX7hlowBx9fPpxmStP2Ac9WIvtepV25eOlDDSlDiS8/4T/DRJHwpRQghSTDnFJ4
 fRU1HFbaj3ZKRgBzu5zJU4XvQO4X/DKxSHVMy+rVtsndWzDa8q11QZrQJAP4ZpLM
 /+zfar2RFnsnop1Dg9cz/hq0gWbQUMJ16qN3cFYl3jVVzC3JSld7B0+DhZdII1Lg
 YsCXuUQusnOTFm9GtyC6doHgmC3Dy2wentz6Vn6ynyBOR4NKjJPkI7PICdgzkIDi
 AJI9qSxQcT6gOSw2PAlzOcYb7AD3cOJCeg3ukt7aDl8S38RKR3w=
 =Is7n
 -----END PGP SIGNATURE-----

Merge tag 'rtc-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux

Pull RTC updates from Alexandre Belloni:
 "Subsytem:
   - rtc_class is now const

  Drivers:
   - ds1511: cleanup, set date and time range and alarm offset limit
   - max31335: fix interrupt handler
   - pcf8523: improve suspend support"

* tag 'rtc-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (28 commits)
  MAINTAINER: Include linux-arm-msm for Qualcomm RTC patches
  dt-bindings: rtc: zynqmp: Add support for Versal/Versal NET SoCs
  rtc: class: make rtc_class constant
  dt-bindings: rtc: abx80x: Improve checks on trickle charger constraints
  MAINTAINERS: adjust file entry in ARM/Mediatek RTC DRIVER
  rtc: nct3018y: fix possible NULL dereference
  rtc: max31335: fix interrupt status reg
  rtc: mt6397: select IRQ_DOMAIN instead of depending on it
  dt-bindings: rtc: abx80x: convert to yaml
  rtc: m41t80: Use the unified property API get the wakeup-source property
  dt-bindings: at91rm9260-rtt: add sam9x7 compatible
  dt-bindings: rtc: convert MT7622 RTC to the json-schema
  dt-bindings: rtc: convert MT2717 RTC to the json-schema
  rtc: pcf8523: add suspend handlers for alarm IRQ
  rtc: ds1511: set alarm offset limit
  rtc: ds1511: set range
  rtc: ds1511: drop inline/noinline hints
  rtc: ds1511: rename pdata
  rtc: ds1511: implement ds1511_rtc_read_alarm properly
  rtc: ds1511: remove partial alarm support
  ...
2024-03-21 17:16:46 -07:00
Linus Torvalds
cba9ffdb99 Including fixes from CAN, netfilter, wireguard and IPsec.
Current release - regressions:
 
  - rxrpc: fix use of page_frag_alloc_align(), it changed semantics
    and we added a new caller in a different subtree
 
  - xfrm: allow UDP encapsulation only in offload modes
 
 Current release - new code bugs:
 
  - tcp: fix refcnt handling in __inet_hash_connect()
 
  - Revert "net: Re-use and set mono_delivery_time bit for userspace tstamp
    packets", conflicted with some expectations in BPF uAPI
 
 Previous releases - regressions:
 
  - ipv4: raw: fix sending packets from raw sockets via IPsec tunnels
 
  - devlink: fix devlink's parallel command processing
 
  - veth: do not manipulate GRO when using XDP
 
  - esp: fix bad handling of pages from page_pool
 
 Previous releases - always broken:
 
  - report RCU QS for busy network kthreads (with Paul McK's blessing)
 
  - tcp/rds: fix use-after-free on netns with kernel TCP reqsk
 
  - virt: vmxnet3: fix missing reserved tailroom with XDP
 
 Misc:
 
  - couple of build fixes for Documentation
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmX8bXsACgkQMUZtbf5S
 IrsfBg/+KzrEx0tB/Af57ZZGZ5PMjPy+XFDox4iFfHm338UFuGXVvZrXd7G+6YkH
 ZwWeF5YDPKzwIEiZ5D3hewZPlkLH0Eg88q74chlE0gUv7t1jhuQHUdIVeFnPcLbN
 t/8AcCZCJ2fENbr1iNnzZON1RW0fVOl+SDxhiSYFeFqii6FywDfqWL/h0u86H/AF
 KRktgb0LzH0waH6IiefVV1NZyjnZwmQ6+UVQerTzUnQmWhV1xQKoO3MQpZuFRvr6
 O+kPZMkrqnTCCy7RO1BexS5cefqc80i5Z25FLGcaHgpnYd2pDNDMMxqrhqO9Y0Pv
 6u/tLgRxzVUDXWouzREIRe50Z9GJswkg78zilAhpqYiHRjd8jaBH6y+9mhGFc7F8
 iVAx02WfJhlk0aynFf2qZmR7PQIb9XjtFJ7OAeJrno9UD7zAubtikGM/6m6IZfRV
 TD1mze95RVnNjbHZMeg6oNLFUMJXVTobtvtqk5pTQvsNsmSYGFvkvWC5/P6ycyYt
 pMx6E0PA/ZCnQAlThCOCzFa5BO+It3RJHcQJhgbOzHrlWKwmrjBKcKJcLLcxFSUt
 4wwjdEcG1Bo2wdnsjwsQwJDHQW+M9TSLdLM3YVptM9jbqOMizoqr6/xSykg3H4wZ
 t/dSiYSsEr06z7lvwbAjUXJ/mfszZ+JsVAFXAN7ahcM4OZb5WTQ=
 =gpLl
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from CAN, netfilter, wireguard and IPsec.

  I'd like to highlight [ lowlight? - Linus ] Florian W stepping down as
  a netfilter maintainer due to constant stream of bug reports. Not sure
  what we can do but IIUC this is not the first such case.

  Current release - regressions:

   - rxrpc: fix use of page_frag_alloc_align(), it changed semantics and
     we added a new caller in a different subtree

   - xfrm: allow UDP encapsulation only in offload modes

  Current release - new code bugs:

   - tcp: fix refcnt handling in __inet_hash_connect()

   - Revert "net: Re-use and set mono_delivery_time bit for userspace
     tstamp packets", conflicted with some expectations in BPF uAPI

  Previous releases - regressions:

   - ipv4: raw: fix sending packets from raw sockets via IPsec tunnels

   - devlink: fix devlink's parallel command processing

   - veth: do not manipulate GRO when using XDP

   - esp: fix bad handling of pages from page_pool

  Previous releases - always broken:

   - report RCU QS for busy network kthreads (with Paul McK's blessing)

   - tcp/rds: fix use-after-free on netns with kernel TCP reqsk

   - virt: vmxnet3: fix missing reserved tailroom with XDP

  Misc:

   - couple of build fixes for Documentation"

* tag 'net-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits)
  selftests: forwarding: Fix ping failure due to short timeout
  MAINTAINERS: step down as netfilter maintainer
  netfilter: nf_tables: Fix a memory leak in nf_tables_updchain
  net: dsa: mt7530: fix handling of all link-local frames
  net: dsa: mt7530: fix link-local frames that ingress vlan filtering ports
  bpf: report RCU QS in cpumap kthread
  net: report RCU QS on threaded NAPI repolling
  rcu: add a helper to report consolidated flavor QS
  ionic: update documentation for XDP support
  lib/bitmap: Fix bitmap_scatter() and bitmap_gather() kernel doc
  netfilter: nf_tables: do not compare internal table flags on updates
  netfilter: nft_set_pipapo: release elements in clone only from destroy path
  octeontx2-af: Use separate handlers for interrupts
  octeontx2-pf: Send UP messages to VF only when VF is up.
  octeontx2-pf: Use default max_active works instead of one
  octeontx2-pf: Wait till detach_resources msg is complete
  octeontx2: Detect the mbox up or down message via register
  devlink: fix port new reply cmd type
  tcp: Clear req->syncookie in reqsk_alloc().
  net/bnx2x: Prevent access to a freed page in page_pool
  ...
2024-03-21 14:50:39 -07:00
Linus Torvalds
1d35aae78f Kbuild updates for v6.9
- Generate a list of built DTB files (arch/*/boot/dts/dtbs-list)
 
  - Use more threads when building Debian packages in parallel
 
  - Fix warnings shown during the RPM kernel package uninstallation
 
  - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to
    Makefile
 
  - Support GCC's -fmin-function-alignment flag
 
  - Fix a null pointer dereference bug in modpost
 
  - Add the DTB support to the RPM package
 
  - Various fixes and cleanups in Kconfig
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmX8HGIVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGYfIQAIl/zEFoNVSHGR4TIvO7SIwkT4MM
 VAm0W6XRFaXfIGw8HL/MXe+U9jAyeQ9yL9uUVv8PqFTO+LzBbW1X1X97tlmrlQsC
 7mdxbA1KJXwkwt4wH/8/EZQMwHr327vtVH4AilSm+gAaWMXaSKAye3ulKQQ2gevz
 vP6aOcfbHIWOPdxA53cLdSl9LOGrYNczKySHXKV9O39T81F+ko7wPpdkiMWw5LWG
 ISRCV8bdXli8j10Pmg8jlbevSKl4Z5FG2BVw/Cl8rQ5tBBoCzFsUPnnp9A29G8QP
 OqRhbwxtkSm67BMJAYdHnhjp/l0AOEbmetTGpna+R06hirOuXhR3vc6YXZxhQjff
 LmKaqfG5YchRALS1fNDsRUNIkQxVJade+tOUG+V4WbxHQKWX7Ghu5EDlt2/x7P0p
 +XLPE48HoNQLQOJ+pgIOkaEDl7WLfGhoEtEgprZBuEP2h39xcdbYJyF10ZAAR4UZ
 FF6J9lDHbf7v1uqD2YnAQJQ6jJ06CvN6/s6SdiJnCWSs5cYRW0fnYigSIuwAgGHZ
 c/QFECoGEflXGGuqZDl5iXiIjhWKzH2nADSVEs7maP47vapcMWb9gA7VBNoOr5M0
 IXuFo1khChF4V2pxqlDj3H5TkDlFENYT/Wjh+vvjx8XplKCRKaSh+LaZ39hja61V
 dWH7BPecS44h4KXx
 =tFdl
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Generate a list of built DTB files (arch/*/boot/dts/dtbs-list)

 - Use more threads when building Debian packages in parallel

 - Fix warnings shown during the RPM kernel package uninstallation

 - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to
   Makefile

 - Support GCC's -fmin-function-alignment flag

 - Fix a null pointer dereference bug in modpost

 - Add the DTB support to the RPM package

 - Various fixes and cleanups in Kconfig

* tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (67 commits)
  kconfig: tests: test dependency after shuffling choices
  kconfig: tests: add a test for randconfig with dependent choices
  kconfig: tests: support KCONFIG_SEED for the randconfig runner
  kbuild: rpm-pkg: add dtb files in kernel rpm
  kconfig: remove unneeded menu_is_visible() call in conf_write_defconfig()
  kconfig: check prompt for choice while parsing
  kconfig: lxdialog: remove unused dialog colors
  kconfig: lxdialog: fix button color for blackbg theme
  modpost: fix null pointer dereference
  kbuild: remove GCC's default -Wpacked-bitfield-compat flag
  kbuild: unexport abs_srctree and abs_objtree
  kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1
  kconfig: remove named choice support
  kconfig: use linked list in get_symbol_str() to iterate over menus
  kconfig: link menus to a symbol
  kbuild: fix inconsistent indentation in top Makefile
  kbuild: Use -fmin-function-alignment when available
  alpha: merge two entries for CONFIG_ALPHA_GAMMA
  alpha: merge two entries for CONFIG_ALPHA_EV4
  kbuild: change DTC_FLAGS_<basetarget>.o to take the path relative to $(obj)
  ...
2024-03-21 14:41:00 -07:00
Linus Torvalds
241590e5a1 Driver core changes for 6.9-rc1
Here is the "big" set of driver core and kernfs changes for 6.9-rc1.
 
 Nothing all that crazy here, just some good updates that include:
   - automatic attribute group hiding from Dan Williams (he fixed up my
     horrible attempt at doing this.)
   - kobject lock contention fixes from Eric Dumazet
   - driver core cleanups from Andy
   - kernfs rcu work from Tejun
   - fw_devlink changes to resolve some reported issues
   - other minor changes, all details in the shortlog
 
 All of these have been in linux-next for a long time with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZfwsHg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynT4ACePcNRAsYrINlOPPKPHimJtyP01yEAn0pZYnj2
 0/UpqIqf3HVPu7zsLKTa
 =vR9S
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the "big" set of driver core and kernfs changes for 6.9-rc1.

  Nothing all that crazy here, just some good updates that include:

   - automatic attribute group hiding from Dan Williams (he fixed up my
     horrible attempt at doing this.)

   - kobject lock contention fixes from Eric Dumazet

   - driver core cleanups from Andy

   - kernfs rcu work from Tejun

   - fw_devlink changes to resolve some reported issues

   - other minor changes, all details in the shortlog

  All of these have been in linux-next for a long time with no reported
  issues"

* tag 'driver-core-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (28 commits)
  device: core: Log warning for devices pending deferred probe on timeout
  driver: core: Use dev_* instead of pr_* so device metadata is added
  driver: core: Log probe failure as error and with device metadata
  of: property: fw_devlink: Add support for "post-init-providers" property
  driver core: Add FWLINK_FLAG_IGNORE to completely ignore a fwnode link
  driver core: Adds flags param to fwnode_link_add()
  debugfs: fix wait/cancellation handling during remove
  device property: Don't use "proxy" headers
  device property: Move enum dev_dma_attr to fwnode.h
  driver core: Move fw_devlink stuff to where it belongs
  driver core: Drop unneeded 'extern' keyword in fwnode.h
  firmware_loader: Suppress warning on FW_OPT_NO_WARN flag
  sysfs:Addresses documentation in sysfs_merge_group and sysfs_unmerge_group.
  firmware_loader: introduce __free() cleanup hanler
  platform-msi: Remove usage of the deprecated ida_simple_xx() API
  sysfs: Introduce DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE()
  sysfs: Document new "group visible" helpers
  sysfs: Fix crash on empty group attributes array
  sysfs: Introduce a mechanism to hide static attribute_groups
  sysfs: Introduce a mechanism to hide static attribute_groups
  ...
2024-03-21 13:34:15 -07:00
Waiman Long
3774b28d8f locking/qspinlock: Always evaluate lockevent* non-event parameter once
The 'inc' parameter of lockevent_add() and the cond parameter of
lockevent_cond_inc() are only evaluated when CONFIG_LOCK_EVENT_COUNTS
is on. That can cause problem if those parameters are expressions
with side effect like a "++". Fix this by evaluating those non-event
parameters once even if CONFIG_LOCK_EVENT_COUNTS is off. This will also
eliminate the need of the __maybe_unused attribute to the wait_early
local variable in pv_wait_node().

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20240319005004.1692705-1-longman@redhat.com
2024-03-21 20:45:17 +01:00
Linus Torvalds
3bcb0bf65c TTY/Serial driver update for 6.9-rc1
Here is the big set of TTY/Serial driver updates and cleanups for
 6.9-rc1.  Included in here are:
   - more tty cleanups from Jiri
   - loads of 8250 driver cleanups from Andy
   - max310x driver updates
   - samsung serial driver updates
   - uart_prepare_sysrq_char() updates for many drivers
   - platform driver remove callback void cleanups
   - stm32 driver updates
   - other small tty/serial driver updates
 
 All of these have been in linux-next for a long time with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZfwqow8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynNegCffxTbsnbMGjWhVrQ326IJx/DFvNMAoI9csigv
 m+G3RzefzZLRx8nAma0c
 =GMfc
 -----END PGP SIGNATURE-----

Merge tag 'tty-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty / serial driver updates from Greg KH:
 "Here is the big set of TTY/Serial driver updates and cleanups for
  6.9-rc1. Included in here are:

   - more tty cleanups from Jiri

   - loads of 8250 driver cleanups from Andy

   - max310x driver updates

   - samsung serial driver updates

   - uart_prepare_sysrq_char() updates for many drivers

   - platform driver remove callback void cleanups

   - stm32 driver updates

   - other small tty/serial driver updates

  All of these have been in linux-next for a long time with no reported
  issues"

* tag 'tty-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (199 commits)
  dt-bindings: serial: stm32: add power-domains property
  serial: 8250_dw: Replace ACPI device check by a quirk
  serial: Lock console when calling into driver before registration
  serial: 8250_uniphier: Switch to use uart_read_port_properties()
  serial: 8250_tegra: Switch to use uart_read_port_properties()
  serial: 8250_pxa: Switch to use uart_read_port_properties()
  serial: 8250_omap: Switch to use uart_read_port_properties()
  serial: 8250_of: Switch to use uart_read_port_properties()
  serial: 8250_lpc18xx: Switch to use uart_read_port_properties()
  serial: 8250_ingenic: Switch to use uart_read_port_properties()
  serial: 8250_dw: Switch to use uart_read_port_properties()
  serial: 8250_bcm7271: Switch to use uart_read_port_properties()
  serial: 8250_bcm2835aux: Switch to use uart_read_port_properties()
  serial: 8250_aspeed_vuart: Switch to use uart_read_port_properties()
  serial: port: Introduce a common helper to read properties
  serial: core: Add UPIO_UNKNOWN constant for unknown port type
  serial: core: Move struct uart_port::quirks closer to possible values
  serial: sh-sci: Call sci_serial_{in,out}() directly
  serial: core: only stop transmit when HW fifo is empty
  serial: pch: Use uart_prepare_sysrq_char().
  ...
2024-03-21 12:44:10 -07:00
Harishankar Vishwanathan
4c2a26fc80 bpf-next: Avoid goto in regs_refine_cond_op()
In case of GE/GT/SGE/JST instructions, regs_refine_cond_op()
reuses the logic that does analysis of LE/LT/SLE/SLT instructions.
This commit avoids the use of a goto to perform the reuse.

Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240321002955.808604-1-harishankar.vishwanathan@gmail.com
2024-03-21 11:56:26 -07:00
Yan Zhai
00bf631224 bpf: report RCU QS in cpumap kthread
When there are heavy load, cpumap kernel threads can be busy polling
packets from redirect queues and block out RCU tasks from reaching
quiescent states. It is insufficient to just call cond_resched() in such
context. Periodically raise a consolidated RCU QS before cond_resched
fixes the problem.

Fixes: 6710e11269 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/c17b9f1517e19d813da3ede5ed33ee18496bb5d8.1710877680.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-20 21:05:43 -07:00
Andrii Nakryiko
68ca5d4eeb bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs
Wire up BPF cookie for raw tracepoint programs (both BTF and non-BTF
aware variants). This brings them up to part w.r.t. BPF cookie usage
with classic tracepoint and fentry/fexit programs.

Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Message-ID: <20240319233852.1977493-4-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-19 23:05:34 -07:00
Andrii Nakryiko
d4dfc5700e bpf: pass whole link instead of prog when triggering raw tracepoint
Instead of passing prog as an argument to bpf_trace_runX() helpers, that
are called from tracepoint triggering calls, store BPF link itself
(struct bpf_raw_tp_link for raw tracepoints). This will allow to pass
extra information like BPF cookie into raw tracepoint registration.

Instead of replacing `struct bpf_prog *prog = __data;` with
corresponding `struct bpf_raw_tp_link *link = __data;` assignment in
`__bpf_trace_##call` I just passed `__data` through into underlying
bpf_trace_runX() call. This works well because we implicitly cast `void *`,
and it also avoids naming clashes with arguments coming from
tracepoint's "proto" list. We could have run into the same problem with
"prog", we just happened to not have a tracepoint that has "prog" input
argument. We are less lucky with "link", as there are tracepoints using
"link" argument name already. So instead of trying to avoid naming
conflicts, let's just remove intermediate local variable. It doesn't
hurt readibility, it's either way a bit of a maze of calls and macros,
that requires careful reading.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Message-ID: <20240319233852.1977493-3-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-19 23:05:33 -07:00
Andrii Nakryiko
6b9c2950c9 bpf: flatten bpf_probe_register call chain
bpf_probe_register() and __bpf_probe_register() have identical
signatures and bpf_probe_register() just redirect to
__bpf_probe_register(). So get rid of this extra function call step to
simplify following the source code.

It has no difference at runtime due to inlining, of course.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Message-ID: <20240319233852.1977493-2-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-19 23:05:33 -07:00
Yonghong Song
eb166e522c bpf: Allow helper bpf_get_[ns_]current_pid_tgid() for all prog types
Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup
and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed
in tracing progs.

We have an internal use case where for an application running
in a container (with pid namespace), user wants to get
the pid associated with the pid namespace in a cgroup bpf
program. Currently, cgroup bpf progs already allow
bpf_get_current_pid_tgid(). Let us allow bpf_get_ns_current_pid_tgid()
as well.

With auditing the code, bpf_get_current_pid_tgid() is also used
by sk_msg prog. But there are no side effect to expose these two
helpers to all prog types since they do not reveal any kernel specific
data. The detailed discussion is in [1].

So with this patch, both bpf_get_current_pid_tgid() and bpf_get_ns_current_pid_tgid()
are put in bpf_base_func_proto(), making them available to all
program types.

  [1] https://lore.kernel.org/bpf/20240307232659.1115872-1-yonghong.song@linux.dev/

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240315184854.2975190-1-yonghong.song@linux.dev
2024-03-19 14:24:07 -07:00
Linus Torvalds
fbd88dd057 More power management updates for 6.9-rc1
- Modify the Energy Model code to bail out and complain if the unit of
    power is not uW to prevent errors due to unit mismatches (Lukasz Luba).
 
  - Make the intel_rapl platform driver use a remove callback returning
    void (Uwe Kleine-König).
 
  - Fix typo in the suspend and interrupts document (Saravana Kannan).
 
  - Make per-policy boost flags actually take effect on platforms using
    cpufreq_boost_set_sw() (Sibi Sankar).
 
  - Enable boost support in the SCMI cpufreq driver (Sibi Sankar).
 
  - Make the DT cpufreq driver use zalloc_cpumask_var() for allocating
    cpumasks to avoid using unitinialized memory (Marek Szyprowski).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmX5iRASHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxof8P/R+MrENvw2bIe5VGwqX99Nd47spbAe2J
 QchuY8R9RIhd4laJ74C+Nc7E1rawLukLHctVql1QzkygldyriAjhHnVeL98Tmh8C
 H+2928BEztPzbIzLMQ7uAVKY+iUD9BfTipPwqV6D118nswtRUbubQkLtrKAmssHQ
 Sm5BqYmEb91CiSIe9wjqCyfc7kq7ina7nqmGa7DgZNMUgCzjmAEDRz1vXV1oRjNl
 aIXpx9X1wtG699HRCXBkxt7SbgXN3bFim6ZFoTvxT4U2Rhcj9JJ/PxKufFL2vuGF
 1M99eu1aMqL0iS3vhj4k88uPG5kDv37qIFYD6Z2vqRVTy040fL+6Sny5IKMlQb+Z
 PUB/TT1rulhH4kCjhkJXuErLqMBG2Ujf8D4drosQK2avp20pF0TCm2h+0LX3HVXM
 dcv+WirjOyljygWm4D0XyvY6U88gcKiGGY/Qf7YQ/GC8Jyrj0h6qMvDw38725XU+
 VQ0JlYCoM1R24dT4iYWWrMbvK6pEQ5R9lwJNSgdtQWOFNmyQy8j8s7D4qyRrCdG/
 JERwNIi+NF3oezg3ahF8s9cemnkndqxlqX1atMmjL9WCbqUPl2LCxJ7TWM5H/mfg
 /h3jNk9ai6d2cHStvoxFzbAYqZX/72IVEteOomKKiHdW7lxWUW+J2xZRy0Pv0163
 iNNyshJAsuaf
 =dlQ/
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These update the Energy Model to make it prevent errors due to power
  unit mismatches, fix a typo in power management documentation, convert
  one driver to using a platform remove callback returning void, address
  two cpufreq issues (one in the core and one in the DT driver), and
  enable boost support in the SCMI cpufreq driver.

  Specifics:

   - Modify the Energy Model code to bail out and complain if the unit
     of power is not uW to prevent errors due to unit mismatches (Lukasz
     Luba)

   - Make the intel_rapl platform driver use a remove callback returning
     void (Uwe Kleine-König)

   - Fix typo in the suspend and interrupts document (Saravana Kannan)

   - Make per-policy boost flags actually take effect on platforms using
     cpufreq_boost_set_sw() (Sibi Sankar)

   - Enable boost support in the SCMI cpufreq driver (Sibi Sankar)

   - Make the DT cpufreq driver use zalloc_cpumask_var() for allocating
     cpumasks to avoid using unitinialized memory (Marek Szyprowski)"

* tag 'pm-6.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: scmi: Enable boost support
  firmware: arm_scmi: Add support for marking certain frequencies as turbo
  cpufreq: dt: always allocate zeroed cpumask
  cpufreq: Fix per-policy boost behavior on SoCs using cpufreq_boost_set_sw()
  Documentation: power: Fix typo in suspend and interrupts doc
  PM: EM: Force device drivers to provide power in uW
  powercap: intel_rapl: Convert to platform remove callback returning void
2024-03-19 11:19:36 -07:00
Jesper Dangaard Brouer
1a4a0cb798 bpf/lpm_trie: Inline longest_prefix_match for fastpath
The BPF map type LPM (Longest Prefix Match) is used heavily
in production by multiple products that have BPF components.
Perf data shows trie_lookup_elem() and longest_prefix_match()
being part of kernels perf top.

For every level in the LPM tree trie_lookup_elem() calls out
to longest_prefix_match().  The compiler is free to inline this
call, but chooses not to inline, because other slowpath callers
(that can be invoked via syscall) exists like trie_update_elem(),
trie_delete_elem() or trie_get_next_key().

 bcc/tools/funccount -Ti 1 'trie_lookup_elem|longest_prefix_match.isra.0'
 FUNC                                    COUNT
 trie_lookup_elem                       664945
 longest_prefix_match.isra.0           8101507

Observation on a single random machine shows a factor 12 between
the two functions. Given an average of 12 levels in the trie being
searched.

This patch force inlining longest_prefix_match(), but only for
the lookup fastpath to balance object instruction size.

In production with AMD CPUs, measuring the function latency of
'trie_lookup_elem' (bcc/tools/funclatency) we are seeing an improvement
function latency reduction 7-8% with this patch applied (to production
kernels 6.6 and 6.1). Analyzing perf data, we can explain this rather
large improvement due to reducing the overhead for AMD side-channel
mitigation SRSO (Speculative Return Stack Overflow).

Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/171076828575.2141737.18370644069389889027.stgit@firesoul
2024-03-19 13:46:03 +01:00
Frederic Weisbecker
0387703986 timers: Fix removed self-IPI on global timer's enqueue in nohz_full
While running in nohz_full mode, a task may enqueue a timer while the
tick is stopped. However the only places where the timer wheel,
alongside the timer migration machinery's decision, may reprogram the
next event accordingly with that new timer's expiry are the idle loop or
any IRQ tail.

However neither the idle task nor an interrupt may run on the CPU if it
resumes busy work in userspace for a long while in full dynticks mode.

To solve this, the timer enqueue path raises a self-IPI that will
re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
avoids potential locking inversion.

This is supposed to happen both for local and global timers but commit:

	b2cf7507e1 ("timers: Always queue timers on the local CPU")

broke the global timers case with removing the ->is_idle field handling
for the global base. As a result, global timers enqueue may go unnoticed
in nohz_full.

Fix this with restoring the idle tracking of the global timer's base,
allowing self-IPIs again on enqueue time.

Fixes: b2cf7507e1 ("timers: Always queue timers on the local CPU")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240318230729.15497-3-frederic@kernel.org
2024-03-19 10:14:55 +01:00
Frederic Weisbecker
f55acb1e44 timers/migration: Fix endless timer requeue after idle interrupts
When a CPU is an idle migrator, but another CPU wakes up before it,
becomes an active migrator and handles the queue, the initial idle
migrator may end up endlessly reprogramming its clockevent, chasing ghost
timers forever such as in the following scenario:

               [GRP0:0]
             migrator = 0
             active   = 0
             nextevt  = T1
              /         \
             0           1
          active        idle (T1)

0) CPU 1 is idle and has a timer queued (T1), CPU 0 is active and is
the active migrator.

               [GRP0:0]
             migrator = NONE
             active   = NONE
             nextevt  = T1
              /         \
             0           1
          idle        idle (T1)
          wakeup = T1

1) CPU 0 is now idle and is therefore the idle migrator. It has
programmed its next timer interrupt to handle T1.

                [GRP0:0]
             migrator = 1
             active   = 1
             nextevt  = KTIME_MAX
              /         \
             0           1
          idle        active
          wakeup = T1

2) CPU 1 has woken up, it is now active and it has just handled its own
timer T1.

3) CPU 0 gets a timer interrupt to handle T1 but tmigr_handle_remote()
realize it is not the migrator anymore. So it early returns without
observing that T1 has been expired already and therefore without
updating its ->wakeup value.

4) CPU 0 goes into tmigr_cpu_new_timer() which also early returns
because it doesn't queue a timer of its own. So ->wakeup is left
unchanged and the next timer is programmed to fire now.

5) goto 3) forever

This results in timer interrupt storms in idle and also in nohz_full (as
observed in rcutorture's TREE07 scenario).

Fix this with forcing a re-evaluation of tmc->wakeup while trying
remote timer handling when the CPU isn't the migrator anymmore. The
check is inherently racy but in the worst case the CPU just races setting
the KTIME_MAX value that a remote expiry also tries to set.

Fixes: 7ee9887703 ("timers: Implement the hierarchical pull model")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240318230729.15497-2-frederic@kernel.org
2024-03-19 10:14:55 +01:00
Linus Torvalds
ad584d73a2 Tracing updates for 6.9:
Main user visible change:
 
 - User events can now have "multi formats"
 
   The current user events have a single format. If another event is created
   with a different format, it will fail to be created. That is, once an
   event name is used, it cannot be used again with a different format. This
   can cause issues if a library is using an event and updates its format.
   An application using the older format will prevent an application using
   the new library from registering its event.
 
   A task could also DOS another application if it knows the event names, and
   it creates events with different formats.
 
   The multi-format event is in a different name space from the single
   format. Both the event name and its format are the unique identifier.
   This will allow two different applications to use the same user event name
   but with different payloads.
 
 - Added support to have ftrace_dump_on_oops dump out instances and
   not just the main top level tracing buffer.
 
 Other changes:
 
 - Add eventfs_root_inode
 
   Only the root inode has a dentry that is static (never goes away) and
   stores it upon creation. There's no reason that the thousands of other
   eventfs inodes should have a pointer that never gets set in its
   descriptor. Create a eventfs_root_inode desciptor that has a eventfs_inode
   descriptor and a dentry pointer, and only the root inode will use this.
 
 - Added WARN_ON()s in eventfs
 
   There's some conditionals remaining in eventfs that should never be hit,
   but instead of removing them, add WARN_ON() around them to make sure that
   they are never hit.
 
 - Have saved_cmdlines allocation also include the map_cmdline_to_pid array
 
   The saved_cmdlines structure allocates a large amount of data to hold its
   mappings. Within it, it has three arrays. Two are already apart of it:
   map_pid_to_cmdline[] and saved_cmdlines[]. More memory can be saved by
   also including the map_cmdline_to_pid[] array as well.
 
 - Restructure __string() and __assign_str() macros used in TRACE_EVENT().
 
   Dynamic strings in TRACE_EVENT() are declared with:
 
       __string(name, source)
 
   And assigned with:
 
      __assign_str(name, source)
 
   In the tracepoint callback of the event, the __string() is used to get the
   size needed to allocate on the ring buffer and __assign_str() is used to
   copy the string into the ring buffer. There's a helper structure that is
   created in the TRACE_EVENT() macro logic that will hold the string length
   and its position in the ring buffer which is created by __string().
 
   There are several trace events that have a function to create the string
   to save. This function is executed twice. Once for __string() and again
   for __assign_str(). There's no reason for this. The helper structure could
   also save the string it used in __string() and simply copy that into
   __assign_str() (it also already has its length).
 
   By using the structure to store the source string for the assignment, it
   means that the second argument to __assign_str() is no longer needed.
 
   It will be removed in the next merge window, but for now add a warning if
   the source string given to __string() is different than the source string
   given to __assign_str(), as the source to __assign_str() isn't even used
   and will be going away.
 
 - Added checks to make sure that the source of __string() is also the
   source of __assign_str() so that it can be safely removed in the next
   merge window.
 
   Included fixes that the above check found.
 
 - Other minor clean ups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZfhbUBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrhJAP9bfnYO7tfNGZVNPmTT7Fz0z4zCU1Pb
 P8M+24yiFTeFWwD/aIPlMFZONVkTdFAlLdffl6kJOKxZ7vW4XzUjfNWb6wo=
 =z/D6
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:
 "Main user visible change:

   - User events can now have "multi formats"

     The current user events have a single format. If another event is
     created with a different format, it will fail to be created. That
     is, once an event name is used, it cannot be used again with a
     different format. This can cause issues if a library is using an
     event and updates its format. An application using the older format
     will prevent an application using the new library from registering
     its event.

     A task could also DOS another application if it knows the event
     names, and it creates events with different formats.

     The multi-format event is in a different name space from the single
     format. Both the event name and its format are the unique
     identifier. This will allow two different applications to use the
     same user event name but with different payloads.

   - Added support to have ftrace_dump_on_oops dump out instances and
     not just the main top level tracing buffer.

  Other changes:

   - Add eventfs_root_inode

     Only the root inode has a dentry that is static (never goes away)
     and stores it upon creation. There's no reason that the thousands
     of other eventfs inodes should have a pointer that never gets set
     in its descriptor. Create a eventfs_root_inode desciptor that has a
     eventfs_inode descriptor and a dentry pointer, and only the root
     inode will use this.

   - Added WARN_ON()s in eventfs

     There's some conditionals remaining in eventfs that should never be
     hit, but instead of removing them, add WARN_ON() around them to
     make sure that they are never hit.

   - Have saved_cmdlines allocation also include the map_cmdline_to_pid
     array

     The saved_cmdlines structure allocates a large amount of data to
     hold its mappings. Within it, it has three arrays. Two are already
     apart of it: map_pid_to_cmdline[] and saved_cmdlines[]. More memory
     can be saved by also including the map_cmdline_to_pid[] array as
     well.

   - Restructure __string() and __assign_str() macros used in
     TRACE_EVENT()

     Dynamic strings in TRACE_EVENT() are declared with:

         __string(name, source)

     And assigned with:

        __assign_str(name, source)

     In the tracepoint callback of the event, the __string() is used to
     get the size needed to allocate on the ring buffer and
     __assign_str() is used to copy the string into the ring buffer.
     There's a helper structure that is created in the TRACE_EVENT()
     macro logic that will hold the string length and its position in
     the ring buffer which is created by __string().

     There are several trace events that have a function to create the
     string to save. This function is executed twice. Once for
     __string() and again for __assign_str(). There's no reason for
     this. The helper structure could also save the string it used in
     __string() and simply copy that into __assign_str() (it also
     already has its length).

     By using the structure to store the source string for the
     assignment, it means that the second argument to __assign_str() is
     no longer needed.

     It will be removed in the next merge window, but for now add a
     warning if the source string given to __string() is different than
     the source string given to __assign_str(), as the source to
     __assign_str() isn't even used and will be going away.

   - Added checks to make sure that the source of __string() is also the
     source of __assign_str() so that it can be safely removed in the
     next merge window.

     Included fixes that the above check found.

   - Other minor clean ups and fixes"

* tag 'trace-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits)
  tracing: Add __string_src() helper to help compilers not to get confused
  tracing: Use strcmp() in __assign_str() WARN_ON() check
  tracepoints: Use WARN() and not WARN_ON() for warnings
  tracing: Use div64_u64() instead of do_div()
  tracing: Support to dump instance traces by ftrace_dump_on_oops
  tracing: Remove second parameter to __assign_rel_str()
  tracing: Add warning if string in __assign_str() does not match __string()
  tracing: Add __string_len() example
  tracing: Remove __assign_str_len()
  ftrace: Fix most kernel-doc warnings
  tracing: Decrement the snapshot if the snapshot trigger fails to register
  tracing: Fix snapshot counter going between two tracers that use it
  tracing: Use EVENT_NULL_STR macro instead of open coding "(null)"
  tracing: Use ? : shortcut in trace macros
  tracing: Do not calculate strlen() twice for __string() fields
  tracing: Rework __assign_str() and __string() to not duplicate getting the string
  cxl/trace: Properly initialize cxl_poison region name
  net: hns3: tracing: fix hclgevf trace event strings
  drm/i915: Add missing ; to __assign_str() macros in tracepoint code
  NFSD: Fix nfsd_clid_class use of __string_len() macro
  ...
2024-03-18 15:11:44 -07:00
Christophe Leroy
c733239f8f bpf: Check return from set_memory_rox()
arch_protect_bpf_trampoline() and alloc_new_pack() call
set_memory_rox() which can fail, leading to unprotected memory.

Take into account return from set_memory_rox() function and add
__must_check flag to arch_protect_bpf_trampoline().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/fe1c163c83767fde5cab31d209a4a6be3ddb3a73.1710574353.git.christophe.leroy@csgroup.eu
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-18 14:18:47 -07:00
Christophe Leroy
e3362acd79 bpf: Remove arch_unprotect_bpf_trampoline()
Last user of arch_unprotect_bpf_trampoline() was removed by
commit 187e2af05a ("bpf: struct_ops supports more than one page for
trampolines.")

Remove arch_unprotect_bpf_trampoline()

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Fixes: 187e2af05a ("bpf: struct_ops supports more than one page for trampolines.")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/r/42c635bb54d3af91db0f9b85d724c7c290069f67.1710574353.git.christophe.leroy@csgroup.eu
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-18 14:18:47 -07:00
Martin KaFai Lau
7f3edd0c72 bpf: Remove unnecessary err < 0 check in bpf_struct_ops_map_update_elem
There is a "if (err)" check earlier, so the "if (err < 0)"
check that this patch removing is unnecessary. It was my overlook
when making adjustments to the bpf_struct_ops_prepare_trampoline()
such that the caller does not have to worry about the new page when
the function returns error.

Fixes: 187e2af05a ("bpf: struct_ops supports more than one page for trampolines.")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20240315192112.2825039-1-martin.lau@linux.dev
2024-03-18 12:48:10 -07:00
Thorsten Blum
d6cb38e108 tracing: Use div64_u64() instead of do_div()
Fixes Coccinelle/coccicheck warnings reported by do_div.cocci.

Compared to do_div(), div64_u64() does not implicitly cast the divisor and
does not unnecessarily calculate the remainder.

Link: https://lore.kernel.org/linux-trace-kernel/20240225164507.232942-2-thorsten.blum@toblux.com

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:33:06 -04:00
Huang Yiwei
19f0423fd5 tracing: Support to dump instance traces by ftrace_dump_on_oops
Currently ftrace only dumps the global trace buffer on an OOPs. For
debugging a production usecase, instance trace will be helpful to
check specific problems since global trace buffer may be used for
other purposes.

This patch extend the ftrace_dump_on_oops parameter to dump a specific
or multiple trace instances:

  - ftrace_dump_on_oops=0: as before -- don't dump
  - ftrace_dump_on_oops[=1]: as before -- dump the global trace buffer
  on all CPUs
  - ftrace_dump_on_oops=2 or =orig_cpu: as before -- dump the global
  trace buffer on CPU that triggered the oops
  - ftrace_dump_on_oops=<instance_name>: new behavior -- dump the
  tracing instance matching <instance_name>
  - ftrace_dump_on_oops[=2/orig_cpu],<instance1_name>[=2/orig_cpu],
  <instrance2_name>[=2/orig_cpu]: new behavior -- dump the global trace
  buffer and multiple instance buffer on all CPUs, or only dump on CPU
  that triggered the oops if =2 or =orig_cpu is given

Also, the sysctl node can handle the input accordingly.

Link: https://lore.kernel.org/linux-trace-kernel/20240223083126.1817731-1-quic_hyiwei@quicinc.com

Cc: Ross Zwisler <zwisler@google.com>
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: <mcgrof@kernel.org>
Cc: <keescook@chromium.org>
Cc: <j.granados@samsung.com>
Cc: <mathieu.desnoyers@efficios.com>
Cc: <corbet@lwn.net>
Signed-off-by: Huang Yiwei <quic_hyiwei@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:33:06 -04:00
Randy Dunlap
d15304135c ftrace: Fix most kernel-doc warnings
Reduce the number of kernel-doc warnings from 52 down to 10, i.e.,
fix 42 kernel-doc warnings by (a) using the Returns: format for
function return values or (b) using "@var:" instead of "@var -"
for function parameter descriptions.

Fix one return values list so that it is formatted correctly when
rendered for output.

Spell "non-zero" with a hyphen in several places.

Link: https://lore.kernel.org/linux-trace-kernel/20240223054833.15471-1-rdunlap@infradead.org

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/oe-kbuild-all/202312180518.X6fRyDSN-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:33:05 -04:00
Steven Rostedt (Google)
2048fdc275 tracing: Decrement the snapshot if the snapshot trigger fails to register
Running the ftrace selftests caused the ring buffer mapping test to fail.
Investigating, I found that the snapshot counter would be incremented
every time a snapshot trigger was added, even if that snapshot trigger
failed.

 # cd /sys/kernel/tracing
 # echo "snapshot" > events/sched/sched_process_fork/trigger
 # echo "snapshot" > events/sched/sched_process_fork/trigger
 -bash: echo: write error: File exists

That second one that fails increments the snapshot counter but doesn't
decrement it. It needs to be decremented when the snapshot fails.

Link: https://lore.kernel.org/linux-trace-kernel/20240223013344.729055907@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:33:05 -04:00
Steven Rostedt (Google)
cca990c7b5 tracing: Fix snapshot counter going between two tracers that use it
Running the ftrace selftests caused the ring buffer mapping test to fail.
Investigating, I found that the snapshot counter would be incremented
every time a tracer that uses the snapshot is enabled even if the snapshot
was used by the previous tracer.

That is:

 # cd /sys/kernel/tracing
 # echo wakeup_rt > current_tracer
 # echo wakeup_dl > current_tracer
 # echo nop > current_tracer

would leave the snapshot counter at 1 and not zero. That's because the
enabling of wakeup_dl would increment the counter again but the setting
the tracer to nop would only decrement it once.

Do not arm the snapshot for a tracer if the previous tracer already had it
armed.

Link: https://lore.kernel.org/linux-trace-kernel/20240223013344.570525723@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:33:05 -04:00
John Garry
ed89683763 tracing: Use init_utsname()->release
Instead of using UTS_RELEASE, use init_utsname()->release, which means that
we don't need to rebuild the code just for the git head commit changing.

Link: https://lore.kernel.org/linux-trace-kernel/20240222124639.65629-1-john.g.garry@oracle.com

Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:13:21 -04:00
Beau Belgrave
64805e4039 tracing/user_events: Introduce multi-format events
Currently user_events supports 1 event with the same name and must have
the exact same format when referenced by multiple programs. This opens
an opportunity for malicious or poorly thought through programs to
create events that others use with different formats. Another scenario
is user programs wishing to use the same event name but add more fields
later when the software updates. Various versions of a program may be
running side-by-side, which is prevented by the current single format
requirement.

Add a new register flag (USER_EVENT_REG_MULTI_FORMAT) which indicates
the user program wishes to use the same user_event name, but may have
several different formats of the event. When this flag is used, create
the underlying tracepoint backing the user_event with a unique name
per-version of the format. It's important that existing ABI users do
not get this logic automatically, even if one of the multi format
events matches the format. This ensures existing programs that create
events and assume the tracepoint name will match exactly continue to
work as expected. Add logic to only check multi-format events with
other multi-format events and single-format events to only check
single-format events during find.

Change system name of the multi-format event tracepoint to ensure that
multi-format events are isolated completely from single-format events.
This prevents single-format names from conflicting with multi-format
events if they end with the same suffix as the multi-format events.

Add a register_name (reg_name) to the user_event struct which allows for
split naming of events. We now have the name that was used to register
within user_events as well as the unique name for the tracepoint. Upon
registering events ensure matches based on first the reg_name, followed
by the fields and format of the event. This allows for multiple events
with the same registered name to have different formats. The underlying
tracepoint will have a unique name in the format of {reg_name}.{unique_id}.

For example, if both "test u32 value" and "test u64 value" are used with
the USER_EVENT_REG_MULTI_FORMAT the system would have 2 unique
tracepoints. The dynamic_events file would then show the following:
  u:test u64 count
  u:test u32 count

The actual tracepoint names look like this:
  test.0
  test.1

Both would be under the new user_events_multi system name to prevent the
older ABI from being used to squat on multi-formatted events and block
their use.

Deleting events via "!u:test u64 count" would only delete the first
tracepoint that matched that format. When the delete ABI is used all
events with the same name will be attempted to be deleted. If
per-version deletion is required, user programs should either not use
persistent events or delete them via dynamic_events.

Link: https://lore.kernel.org/linux-trace-kernel/20240222001807.1463-3-beaub@linux.microsoft.com

Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:13:03 -04:00
Beau Belgrave
1e953de9e9 tracing/user_events: Prepare find/delete for same name events
The current code for finding and deleting events assumes that there will
never be cases when user_events are registered with the same name, but
different formats. Scenarios exist where programs want to use the same
name but have different formats. An example is multiple versions of a
program running side-by-side using the same event name, but with updated
formats in each version.

This change does not yet allow for multi-format events. If user_events
are registered with the same name but different arguments the programs
see the same return values as before. This change simply makes it
possible to easily accommodate for this.

Update find_user_event() to take in argument parameters and register
flags to accommodate future multi-format event scenarios. Have find
validate argument matching and return error pointers to cover when
an existing event has the same name but different format. Update
callers to handle error pointer logic.

Move delete_user_event() to use hash walking directly now that
find_user_event() has changed. Delete all events found that match the
register name, stop if an error occurs and report back to the user.

Update user_fields_match() to cover list_empty() scenarios now that
find_user_event() uses it directly. This makes the logic consistent
across several callsites.

Link: https://lore.kernel.org/linux-trace-kernel/20240222001807.1463-2-beaub@linux.microsoft.com

Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:12:54 -04:00
Vincent Donnefort
180e4e3909 tracing: Add snapshot refcount
When a ring-buffer is memory mapped by user-space, no trace or
ring-buffer swap is possible. This means the snapshot feature is
mutually exclusive with the memory mapping. Having a refcount on
snapshot users will help to know if a mapping is possible or not.

Instead of relying on the global trace_types_lock, a new spinlock is
introduced to serialize accesses to trace_array->snapshot. This intends
to allow access to that variable in a context where the mmap lock is
already held.

Link: https://lore.kernel.org/linux-trace-kernel/20240220202310.2489614-4-vdonnefort@google.com

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:12:47 -04:00
Steven Rostedt (Google)
b70f293824 ring-buffer: Make wake once of ring_buffer_wait() more robust
The default behavior of ring_buffer_wait() when passed a NULL "cond"
parameter is to exit the function the first time it is woken up. The
current implementation uses a counter that starts at zero and when it is
greater than one it exits the wait_event_interruptible().

But this relies on the internal working of wait_event_interruptible() as
that code basically has:

  if (cond)
    return;
  prepare_to_wait();
  if (!cond)
    schedule();
  finish_wait();

That is, cond is called twice before it sleeps. The default cond of
ring_buffer_wait() needs to account for that and wait for its counter to
increment twice before exiting.

Instead, use the seq/atomic_inc logic that is used by the tracing code
that calls this function. Add an atomic_t seq to rb_irq_work and when cond
is NULL, have the default callback take a descriptor as its data that
holds the rbwork and the value of the seq when it started.

The wakeups will now increment the rbwork->seq and the cond callback will
simply check if that number is different, and no longer have to rely on
the implementation of wait_event_interruptible().

Link: https://lore.kernel.org/linux-trace-kernel/20240315063115.6cb5d205@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 7af9ded0c2 ("ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-18 10:11:43 -04:00
Linus Torvalds
8048ba24e1 Fix timer migration bug that can result in long bootup
delays and other oddities.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmX2sR4RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gGYA/+PLcIFIaaoAtZProEYYwKBqf/G59RWCNF
 hP5ytKDy/59NlIP7VhRWXYUbxzZkiFpvNduOdnUaU7gYKiQGk2kpuD6EM479t8og
 H4nSd8/izuS995kBwuthpISssxggoWXYU30fA6JRIJU7Imfwjcapvlv6Nrj2+NEj
 vsw4HNAtXgj+e8ZSDmjzw00zsSaJhIQx539mGYNvMF3geGzDdRuM2XXHjXj27WSO
 /BPS+oFhR0IYIf+FUVCo5w57SgQamdL2rAobZq6y89visqnvOfej4g1Je/mgUjO/
 I/dcoxhLFp3Ebp68c3l4KbH+3f9LheLU+6xCdMMxOUbVyZEtziuFBqXjsh10b3ey
 n4hYnGq35vhIYYRrlQsBYuPegn9MECb0A1rOvj2HwWM9IfpWSZ+6thrj78ZT0SEw
 cxxFqGBWLFwq6wRIwEQDriN8Rbc5FgybPHqRcmJjDwtlRFFJzs6u78IViZWHpq18
 oYxLzlfhOrjhKG/0W2yVpMqNgZdolO1pX3Xz/5mF7JnnlcOibOC5GMIo8AvehxsN
 zCI6QhKs8ZcfHbY2QSZQnn3g07VbG+3IbaA2pH1ZgWAVEPOqRqZthKFCoBzsno+u
 if2dqLoOMmPrW4N4u+o6sAu1WYNUyI1bEGKsl6u/gXawlcCmbjbgr5OjQ53ehW+N
 mNFUKOIX4Pk=
 =MhsC
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2024-03-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
 "Fix timer migration bug that can result in long bootup delays and
  other oddities"

* tag 'timers-urgent-2024-03-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timer/migration: Remove buggy early return on deactivation
2024-03-17 12:19:02 -07:00
linke li
f1e30cb636 ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment
In function ring_buffer_iter_empty(), cpu_buffer->commit_page is read
while other threads may change it. It may cause the time_stamp that read
in the next line come from a different page. Use READ_ONCE() to avoid
having to reason about compiler optimizations now and in future.

Link: https://lore.kernel.org/linux-trace-kernel/tencent_DFF7D3561A0686B5E8FC079150A02505180A@qq.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: linke li <lilinke99@qq.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-17 07:58:53 -04:00
Vincent Donnefort
6b76323e5a ring-buffer: Zero ring-buffer sub-buffers
In preparation for the ring-buffer memory mapping where each subbuf will
be accessible to user-space, zero all the page allocations.

Link: https://lore.kernel.org/linux-trace-kernel/20240220202310.2489614-2-vdonnefort@google.com

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-17 07:58:53 -04:00
Steven Rostedt (Google)
2cc621fd2e tracing: Move saved_cmdline code into trace_sched_switch.c
The code that handles saved_cmdlines is split between the trace.c file and
the trace_sched_switch.c. There's some history to this. The
trace_sched_switch.c was originally created to handle the sched_switch
tracer that was deprecated due to sched_switch trace event making it
obsolete. But that file did not get deleted as it had some code to help
with saved_cmdlines. But trace.c has grown tremendously since then. Just
move all the saved_cmdlines code into trace_sched_switch.c as that's the
only reason that file still exists, and trace.c has gotten too big.

No functional changes.

Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.497966629@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-17 07:58:53 -04:00
Steven Rostedt (Google)
e85d471c2b tracing: Move open coded processing of tgid_map into helper function
In preparation of moving the saved_cmdlines logic out of trace.c and into
trace_sched_switch.c, replace the open coded manipulation of tgid_map in
set_tracer_flag() into a helper function trace_alloc_tgid_map() so that it
can be easily moved into trace_sched_switch.c without changing existing
functions in trace.c.

No functional changes.

Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.338116216@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-17 07:58:52 -04:00