1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

41213 commits

Author SHA1 Message Date
Nicholas Piggin
001c28e571 exit: Detect and fix irq disabled state in oops
If a task oopses with irqs disabled, this can cause various cascading
problems in the oops path such as sleep-from-invalid warnings, and
potentially worse.

Since commit 0258b5fd7c ("coredump: Limit coredumps to a single
thread group"), the unconditional irq enable in coredump_task_exit()
will "fix" the irq state to be enabled early in do_exit(), so currently
this may not be triggerable, but that is coincidental and fragile.

Detect and fix the irqs_disabled() condition in the oops path before
calling do_exit(), similarly to the way in_atomic() is handled.

Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/lkml/20221004094401.708299-1-npiggin@gmail.com/
2023-01-21 00:06:10 +01:00
Jakub Kicinski
b3c588cd55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ipa/ipa_interrupt.c
drivers/net/ipa/ipa_interrupt.h
  9ec9b2a308 ("net: ipa: disable ipa interrupt during suspend")
  8e461e1f09 ("net: ipa: introduce ipa_interrupt_enable()")
  d50ed35587 ("net: ipa: enable IPA interrupt handlers separate from registration")
https://lore.kernel.org/all/20230119114125.5182c7ab@canb.auug.org.au/
https://lore.kernel.org/all/79e46152-8043-a512-79d9-c3b905462774@tessares.net/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20 12:28:23 -08:00
Linus Torvalds
5deaa98587 Including fixes from wireless, bluetooth, bpf and netfilter.
Current release - regressions:
 
  - Revert "net: team: use IFF_NO_ADDRCONF flag to prevent ipv6
    addrconf", fix nsna_ping mode of team
 
  - wifi: mt76: fix bugs in Rx queue handling and DMA mapping
 
  - eth: mlx5:
    - add missing mutex_unlock in error reporter
    - protect global IPsec ASO with a lock
 
 Current release - new code bugs:
 
  - rxrpc: fix wrong error return in rxrpc_connect_call()
 
 Previous releases - regressions:
 
  - bluetooth: hci_sync: fix use of HCI_OP_LE_READ_BUFFER_SIZE_V2
 
  - wifi:
    - mac80211: fix crashes on Rx due to incorrect initialization of
      rx->link and rx->link_sta
    - mac80211: fix bugs in iTXQ conversion - Tx stalls, incorrect
      aggregation handling, crashes
    - brcmfmac: fix regression for Broadcom PCIe wifi devices
    - rndis_wlan: prevent buffer overflow in rndis_query_oid
 
  - netfilter: conntrack: handle tcp challenge acks during connection
    reuse
 
  - sched: avoid grafting on htb_destroy_class_offload when destroying
 
  - virtio-net: correctly enable callback during start_xmit, fix stalls
 
  - tcp: avoid the lookup process failing to get sk in ehash table
 
  - ipa: disable ipa interrupt during suspend
 
  - eth: stmmac: enable all safety features by default
 
 Previous releases - always broken:
 
  - bpf:
    - fix pointer-leak due to insufficient speculative store bypass
      mitigation (Spectre v4)
    - skip task with pid=1 in send_signal_common() to avoid a splat
    - fix BPF program ID information in BPF_AUDIT_UNLOAD as well as
      PERF_BPF_EVENT_PROG_UNLOAD events
    - fix potential deadlock in htab_lock_bucket from same bucket index
      but different map_locked index
 
  - bluetooth:
    - fix a buffer overflow in mgmt_mesh_add()
    - hci_qca: fix driver shutdown on closed serdev
    - ISO: fix possible circular locking dependency
    - CIS: hci_event: fix invalid wait context
 
  - wifi: brcmfmac: fixes for survey dump handling
 
  - mptcp: explicitly specify sock family at subflow creation time
 
  - netfilter: nft_payload: incorrect arithmetics when fetching VLAN
    header bits
 
  - tcp: fix rate_app_limited to default to 1
 
  - l2tp: close all race conditions in l2tp_tunnel_register()
 
  - eth: mlx5: fixes for QoS config and eswitch configuration
 
  - eth: enetc: avoid deadlock in enetc_tx_onestep_tstamp()
 
  - eth: stmmac: fix invalid call to mdiobus_get_phy()
 
 Misc:
 
  - ethtool: add netlink attr in rss get reply only if the value is
    not empty
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmPKxFAACgkQMUZtbf5S
 Irtdlg/+OUggv3sKhcpUv39SnmMyiIhnNj9KhG+25Iiy7MJxaoNntCGsW3KXXTGo
 JylOGMesociz+hCv4xHl9J61uwmz+qrUPKqqi7hSbEoHlAa3OrubQb+LW4/x0Jp0
 bNJqqAYt04C+txhvsuF9odfZbKtvQ7RIU0XEzqEoES4UXTYQoCHEAwfn5twKDNNS
 /x9OLSZnctAv1pinKZ8QTLjz0IZwHaBAbWNXkLe/HEu9nGrUndFtA5rJjyzrjw10
 ZltTDfV2lr3SWVHsJShnTJ64u+aPBGmJmVzeNw64qRrmnYdFMCpUVoH222IurexO
 aVPY9WUOwgUovetB8fmhPF0+n5Aa6pbTb4toQB1oVZ8X0h7WNrdfXZug1QDQOMbC
 eGzsNdk6hvOeqBhbIKPLQXzaIxbPyXM+KUUbOxi+V4dahG79vG2BaQsrpFymueVs
 cna7pL8dE1S9dR3SEB0KW4nyoWIugukZrzuX0efv1hxovuWn4yNJBt2lp8gQwY6v
 yTk93Ou2LYDrm4yXLrHHWYNXU1u68Pq0o14xbx7tOYGan/evqfaaa1lmAvj7b1bq
 g19FB4IrwA/1ZBoaOIMV8Ml7u5ww9LAFzJRAClEptOopADN4Gro2jgUYWjmxm+uV
 RdlpQ2mI8iEeEH0FOITmdlFy7cbh7TWIkoiXHcCWifgfUE7sxnY=
 =F3be
 -----END PGP SIGNATURE-----

Merge tag 'net-6.2-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from wireless, bluetooth, bpf and netfilter.

  Current release - regressions:

   - Revert "net: team: use IFF_NO_ADDRCONF flag to prevent ipv6
     addrconf", fix nsna_ping mode of team

   - wifi: mt76: fix bugs in Rx queue handling and DMA mapping

   - eth: mlx5:
      - add missing mutex_unlock in error reporter
      - protect global IPsec ASO with a lock

  Current release - new code bugs:

   - rxrpc: fix wrong error return in rxrpc_connect_call()

  Previous releases - regressions:

   - bluetooth: hci_sync: fix use of HCI_OP_LE_READ_BUFFER_SIZE_V2

   - wifi:
      - mac80211: fix crashes on Rx due to incorrect initialization of
        rx->link and rx->link_sta
      - mac80211: fix bugs in iTXQ conversion - Tx stalls, incorrect
        aggregation handling, crashes
      - brcmfmac: fix regression for Broadcom PCIe wifi devices
      - rndis_wlan: prevent buffer overflow in rndis_query_oid

   - netfilter: conntrack: handle tcp challenge acks during connection
     reuse

   - sched: avoid grafting on htb_destroy_class_offload when destroying

   - virtio-net: correctly enable callback during start_xmit, fix stalls

   - tcp: avoid the lookup process failing to get sk in ehash table

   - ipa: disable ipa interrupt during suspend

   - eth: stmmac: enable all safety features by default

  Previous releases - always broken:

   - bpf:
      - fix pointer-leak due to insufficient speculative store bypass
        mitigation (Spectre v4)
      - skip task with pid=1 in send_signal_common() to avoid a splat
      - fix BPF program ID information in BPF_AUDIT_UNLOAD as well as
        PERF_BPF_EVENT_PROG_UNLOAD events
      - fix potential deadlock in htab_lock_bucket from same bucket
        index but different map_locked index

   - bluetooth:
      - fix a buffer overflow in mgmt_mesh_add()
      - hci_qca: fix driver shutdown on closed serdev
      - ISO: fix possible circular locking dependency
      - CIS: hci_event: fix invalid wait context

   - wifi: brcmfmac: fixes for survey dump handling

   - mptcp: explicitly specify sock family at subflow creation time

   - netfilter: nft_payload: incorrect arithmetics when fetching VLAN
     header bits

   - tcp: fix rate_app_limited to default to 1

   - l2tp: close all race conditions in l2tp_tunnel_register()

   - eth: mlx5: fixes for QoS config and eswitch configuration

   - eth: enetc: avoid deadlock in enetc_tx_onestep_tstamp()

   - eth: stmmac: fix invalid call to mdiobus_get_phy()

  Misc:

   - ethtool: add netlink attr in rss get reply only if the value is not
     empty"

* tag 'net-6.2-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
  Revert "Merge branch 'octeontx2-af-CPT'"
  tcp: fix rate_app_limited to default to 1
  bnxt: Do not read past the end of test names
  net: stmmac: enable all safety features by default
  octeontx2-af: add mbox to return CPT_AF_FLT_INT info
  octeontx2-af: update cpt lf alloc mailbox
  octeontx2-af: restore rxc conf after teardown sequence
  octeontx2-af: optimize cpt pf identification
  octeontx2-af: modify FLR sequence for CPT
  octeontx2-af: add mbox for CPT LF reset
  octeontx2-af: recover CPT engine when it gets fault
  net: dsa: microchip: ksz9477: port map correction in ALU table entry register
  selftests/net: toeplitz: fix race on tpacket_v3 block close
  net/ulp: use consistent error code when blocking ULP
  octeontx2-pf: Fix the use of GFP_KERNEL in atomic context on rt
  tcp: avoid the lookup process failing to get sk in ehash table
  Revert "net: team: use IFF_NO_ADDRCONF flag to prevent ipv6 addrconf"
  MAINTAINERS: add networking entries for Willem
  net: sched: gred: prevent races when adding offloads to stats
  l2tp: prevent lockdep issue in l2tp_tunnel_register()
  ...
2023-01-20 09:58:44 -08:00
Paul E. McKenney
52e0452b41 PM: sleep: Remove "select SRCU"
Now that the SRCU Kconfig option is unconditionally selected, there is
no longer any point in selecting it.  Therefore, remove the "select SRCU"
Kconfig statements.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-01-20 17:53:07 +01:00
Jiri Olsa
74bc3a5acc bpf: Add missing btf_put to register_btf_id_dtor_kfuncs
We take the BTF reference before we register dtors and we need
to put it back when it's done.

We probably won't se a problem with kernel BTF, but module BTF
would stay loaded (because of the extra ref) even when its module
is removed.

Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 5ce937d613 ("bpf: Populate pairs of btf_id and destructor kfunc in btf")
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230120122148.1522359-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-20 08:19:56 -08:00
Randy Dunlap
6b37dfcb39 PM: hibernate: swap: don't use /** for non-kernel-doc comments
kernel-doc complains about multiple occurrences of "/**" being used
for something that is not a kernel-doc comment, so change all of these
to just use "/*" comment style.

The warning message for all of these is:

FILE:LINE: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst

kernel/power/swap.c:585: warning: ...
Structure used for CRC32.
kernel/power/swap.c:600: warning: ...
 * CRC32 update function that runs in its own thread.
kernel/power/swap.c:627: warning: ...
 * Structure used for LZO data compression.
kernel/power/swap.c:644: warning: ...
 * Compression function that runs in its own thread.
kernel/power/swap.c:952: warning: ...
 *      The following functions allow us to read data using a swap map
kernel/power/swap.c:1111: warning: ...
 * Structure used for LZO data decompression.
kernel/power/swap.c:1127: warning: ...
 * Decompression function that runs in its own thread.

Also correct one spello/typo.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-01-20 14:49:48 +01:00
Thomas Weißschuh
00142bfd5a kernels/ksysfs.c: export kernel address bits
This can be used by userspace to determine the address size of the
running kernel.
It frees userspace from having to interpret this information from the
UTS machine field.

Userspace implementation:
https://github.com/util-linux/util-linux/pull/1966

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20221221-address-bits-v1-1-8446b13244ac@weissschuh.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-20 14:30:45 +01:00
Jiri Olsa
6a5f2d6ee8 bpf: Change modules resolving for kprobe multi link
We currently use module_kallsyms_on_each_symbol that iterates all
modules/symbols and we try to lookup each such address in user
provided symbols/addresses to get list of used modules.

This fix instead only iterates provided kprobe addresses and calls
__module_address on each to get list of used modules. This turned
out to be simpler and also bit faster.

On my setup with workload (executed 10 times):

   # test_progs -t kprobe_multi_bench_attach/modules

Current code:

 Performance counter stats for './test.sh' (5 runs):

    76,081,161,596      cycles:k                   ( +-  0.47% )

           18.3867 +- 0.0992 seconds time elapsed  ( +-  0.54% )

With the fix:

 Performance counter stats for './test.sh' (5 runs):

    74,079,889,063      cycles:k                   ( +-  0.04% )

           17.8514 +- 0.0218 seconds time elapsed  ( +-  0.12% )

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230116101009.23694-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-19 17:07:40 -08:00
Zhen Lei
07cc2c931e livepatch: Improve the search performance of module_kallsyms_on_each_symbol()
Currently we traverse all symbols of all modules to find the specified
function for the specified module. But in reality, we just need to find
the given module and then traverse all the symbols in it.

Let's add a new parameter 'const char *modname' to function
module_kallsyms_on_each_symbol(), then we can compare the module names
directly in this function and call hook 'fn' after matching. If 'modname'
is NULL, the symbols of all modules are still traversed for compatibility
with other usage cases.

Phase1: mod1-->mod2..(subsequent modules do not need to be compared)
                |
Phase2:          -->f1-->f2-->f3

Assuming that there are m modules, each module has n symbols on average,
then the time complexity is reduced from O(m * n) to O(m) + O(n).

Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20230116101009.23694-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-19 17:07:15 -08:00
Eduard Zingerman
71f656a501 bpf: Fix to preserve reg parent/live fields when copying range info
Register range information is copied in several places. The intent is
to transfer range/id information from one register/stack spill to
another. Currently this is done using direct register assignment, e.g.:

static void find_equal_scalars(..., struct bpf_reg_state *known_reg)
{
	...
	struct bpf_reg_state *reg;
	...
			*reg = *known_reg;
	...
}

However, such assignments also copy the following bpf_reg_state fields:

struct bpf_reg_state {
	...
	struct bpf_reg_state *parent;
	...
	enum bpf_reg_liveness live;
	...
};

Copying of these fields is accidental and incorrect, as could be
demonstrated by the following example:

     0: call ktime_get_ns()
     1: r6 = r0
     2: call ktime_get_ns()
     3: r7 = r0
     4: if r0 > r6 goto +1             ; r0 & r6 are unbound thus generated
                                       ; branch states are identical
     5: *(u64 *)(r10 - 8) = 0xdeadbeef ; 64-bit write to fp[-8]
    --- checkpoint ---
     6: r1 = 42                        ; r1 marked as written
     7: *(u8 *)(r10 - 8) = r1          ; 8-bit write, fp[-8] parent & live
                                       ; overwritten
     8: r2 = *(u64 *)(r10 - 8)
     9: r0 = 0
    10: exit

This example is unsafe because 64-bit write to fp[-8] at (5) is
conditional, thus not all bytes of fp[-8] are guaranteed to be set
when it is read at (8). However, currently the example passes
verification.

First, the execution path 1-10 is examined by verifier.
Suppose that a new checkpoint is created by is_state_visited() at (6).
After checkpoint creation:
- r1.parent points to checkpoint.r1,
- fp[-8].parent points to checkpoint.fp[-8].
At (6) the r1.live is set to REG_LIVE_WRITTEN.
At (7) the fp[-8].parent is set to r1.parent and fp[-8].live is set to
REG_LIVE_WRITTEN, because of the following code called in
check_stack_write_fixed_off():

static void save_register_state(struct bpf_func_state *state,
				int spi, struct bpf_reg_state *reg,
				int size)
{
	...
	state->stack[spi].spilled_ptr = *reg;  // <--- parent & live copied
	if (size == BPF_REG_SIZE)
		state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN;
	...
}

Note the intent to mark stack spill as written only if 8 bytes are
spilled to a slot, however this intent is spoiled by a 'live' field copy.
At (8) the checkpoint.fp[-8] should be marked as REG_LIVE_READ but
this does not happen:
- fp[-8] in a current state is already marked as REG_LIVE_WRITTEN;
- fp[-8].parent points to checkpoint.r1, parentage chain is used by
  mark_reg_read() to mark checkpoint states.
At (10) the verification is finished for path 1-10 and jump 4-6 is
examined. The checkpoint.fp[-8] never gets REG_LIVE_READ mark and this
spill is pruned from the cached states by clean_live_states(). Hence
verifier state obtained via path 1-4,6 is deemed identical to one
obtained via path 1-6 and program marked as safe.

Note: the example should be executed with BPF_F_TEST_STATE_FREQ flag
set to force creation of intermediate verifier states.

This commit revisits the locations where bpf_reg_state instances are
copied and replaces the direct copies with a call to a function
copy_register_state(dst, src) that preserves 'parent' and 'live'
fields of the 'dst'.

Fixes: 679c782de1 ("bpf/verifier: per-register parent pointers")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230106142214.1040390-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-19 15:19:23 -08:00
Linus Torvalds
d368967cb1 Printk fixes for 6.2-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmPJTNMACgkQUqAMR0iA
 lPL3tBAApmvatVLQfjT10AnpXv/RhNqeL3JbTHCOU7EiG9OpmoNrXzJA+GAyhyXz
 FAViF2Xyfrl3XtYsJUe0hVQNUuGm2tz2owqg7OFuTcvcs9AzGhHK7CCoJv+f580P
 MyojbXFaBw/eqk5LMB5lqR6rhjiz5UeGTXL32StdqBBV9KlZTpwt36Rwc59i0Pkn
 uk4QSp2bJQIgHkSZt8pyspRoR2QLZhntVHvAbvqjIsRWVyqkIFvPc7djKVxPfJBM
 4tf9ijEHkbLL1cFjhFvbstjsEuUu0G3lmgNa/0zID2tXbywQXKTfg+sa7AOWCn8o
 EN9bbTdwGSkfbOnH+l9cmmljeSPPyMCskEFvm7c5yaFR1DZHj6G1OjklQnG6Anjc
 l2GD5kYYi2QTRCI6YMtjb7851i/nY4oAOmCMbzDtWlzn4Kxs0vWdXHoaw/KcoBgh
 W7z2crC7k7+N6DUAkEJz0ivIQJtrASCV9X9DyYC/Y03VahFNsNVA3Le8mJGeWZhk
 rAR+M8sp+UCBByWOPHIUT0UknfyGt4MczmtS9eftsbRwVNUPRlagJrOahJCGooKU
 WmXvu8q9in/+h2X3JhLe4thTQd72XmvbKWXYJthYtMDZMPwkezTYFnDi3n21bPV7
 o+pl80LJG0F1KJHLHt2tO+9ul8f1mvuIyG5MXOMkT0CQCBDNQQQ=
 =aYZ6
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk fixes from Petr Mladek:

 - Prevent a potential deadlock when configuring kgdb console

 - Fix a kernel doc warning

* tag 'printk-for-6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  kernel/printk/printk.c: Fix W=1 kernel-doc warning
  tty: serial: kgdboc: fix mutex locking order for configure_kgdboc()
2023-01-19 12:32:07 -08:00
Petr Mladek
21493c6e96 Merge branch 'rework/console-list-lock' into for-linus 2023-01-19 14:56:38 +01:00
Mike Leach
7d30d480a6 kernel: events: Export perf_report_aux_output_id()
CoreSight trace being updated to use the perf_report_aux_output_id()
in a similar way to intel-pt.

This function in needs export visibility to allow it to be called from
kernel loadable modules, which CoreSight may configured to be built as.

Signed-off-by: Mike Leach <mike.leach@linaro.org>
Acked-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20230116124928.5440-12-mike.leach@linaro.org
2023-01-19 10:16:47 +00:00
Christian Brauner
e67fe63341
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
Convert to struct mnt_idmap.
Remove legacy file_mnt_user_ns() and mnt_user_ns().

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:29 +01:00
Christian Brauner
9452e93e6d
fs: port privilege checking helpers to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:29 +01:00
Christian Brauner
f2d40141d5
fs: port inode_init_owner() to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner
39f60c1cce
fs: port xattr to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner
4609e1f18e
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner
c54bd91e9e
fs: port ->mkdir() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:26 +01:00
Christian Brauner
7a77db9551
fs: port ->symlink() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Yonghong Song
bdb7fdb0ac bpf: Fix a possible task gone issue with bpf_send_signal[_thread]() helpers
In current bpf_send_signal() and bpf_send_signal_thread() helper
implementation, irq_work is used to handle nmi context. Hao Sun
reported in [1] that the current task at the entry of the helper
might be gone during irq_work callback processing. To fix the issue,
a reference is acquired for the current task before enqueuing into
the irq_work so that the queued task is still available during
irq_work callback processing.

  [1] https://lore.kernel.org/bpf/20230109074425.12556-1-sunhao.th@gmail.com/

Fixes: 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
Tested-by: Hao Sun <sunhao.th@gmail.com>
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230118204815.3331855-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-18 18:44:16 -08:00
Hou Tao
36024d023d bpf: Fix off-by-one error in bpf_mem_cache_idx()
According to the definition of sizes[NUM_CACHES], when the size passed
to bpf_mem_cache_size() is 256, it should return 6 instead 7.

Fixes: 7c8199e24f ("bpf: Introduce any context BPF specific memory allocator.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230118084630.3750680-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-18 18:36:26 -08:00
Jeff Xu
105ff5339f mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set
executable bit at creation time (memfd_create).

When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be
executable (mode: 0777) after creation.

when MFD_EXEC flag is set, memfd is created with executable bit
(mode:0777), this is the same as the old behavior of memfd_create.

The new pid namespaced sysctl vm.memfd_noexec has 3 values:
0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
        MFD_EXEC was set.
1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
        MFD_NOEXEC_SEAL was set.
2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.

The sysctl allows finer control of memfd_create for old-software that
doesn't set the executable bit, for example, a container with
vm.memfd_noexec=1 means the old-software will create non-executable memfd
by default.  Also, the value of memfd_noexec is passed to child namespace
at creation time.  For example, if the init namespace has
vm.memfd_noexec=2, all its children namespaces will be created with 2.

[akpm@linux-foundation.org: add stub functions to fix build]
[akpm@linux-foundation.org: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff]
[akpm@linux-foundation.org: s/pr_warn_ratelimited/pr_warn_once/, per review]
[akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warning]
Link: https://lkml.kernel.org/r/20221215001205.51969-4-jeffxu@google.com
Signed-off-by: Jeff Xu <jeffxu@google.com>
Co-developed-by: Daniel Verkamp <dverkamp@chromium.org>
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:37 -08:00
Namhyung Kim
0eed282205 perf/core: Call perf_prepare_sample() before running BPF
As BPF can access sample data, it needs to populate the data.  Also
remove the logic to get the callchain specifically as it's covered by
the perf_prepare_sample() now.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-9-namhyung@kernel.org
2023-01-18 11:57:21 +01:00
Namhyung Kim
f6e707156e perf/core: Introduce perf_prepare_header()
Factor out perf_prepare_header() so that it can call
perf_prepare_sample() without a header if not needed.

Also it checks the filtered_sample_type to avoid duplicate
work when perf_prepare_sample() is called twice (or more).

Suggested-by: Peter Zijlstr <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-8-namhyung@kernel.org
2023-01-18 11:57:20 +01:00
Namhyung Kim
a7c8d0daa8 perf/core: Do not pass header for sample ID init
The only thing it does for header in __perf_event_header__init_id() is
to update the header size with event->id_header_size.  We can do this
outside and get rid of the argument for the later change.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-7-namhyung@kernel.org
2023-01-18 11:57:20 +01:00
Namhyung Kim
bb447c27a4 perf/core: Set data->sample_flags in perf_prepare_sample()
The perf_prepare_sample() function sets the perf_sample_data according
to the attr->sample_type before copying it to the ring buffer.  But BPF
also wants to access the sample data so it needs to prepare the sample
even before the regular path.

That means perf_prepare_sample() can be called more than once.  Set
the data->sample_flags consistently so that it can indicate which fields
are set already and skip them if sets.

Also update the filtered_sample_type to have the dependent flags to
reduce the number of branches.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-6-namhyung@kernel.org
2023-01-18 11:57:20 +01:00
Namhyung Kim
eb55b455ef perf/core: Add perf_sample_save_brstack() helper
When we saves the branch stack to the perf sample data, we needs to
update the sample flags and the dynamic size.  To make sure this is
done consistently, add the perf_sample_save_brstack() helper and
convert all call sites.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-5-namhyung@kernel.org
2023-01-18 11:57:20 +01:00
Namhyung Kim
0a9081cf0a perf/core: Add perf_sample_save_raw_data() helper
When we save the raw_data to the perf sample data, we need to update
the sample flags and the dynamic size.  To make sure this is done
consistently, add the perf_sample_save_raw_data() helper and convert
all call sites.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-4-namhyung@kernel.org
2023-01-18 11:57:19 +01:00
Namhyung Kim
31046500c1 perf/core: Add perf_sample_save_callchain() helper
When we save the callchain to the perf sample data, we need to update
the sample flags and the dynamic size.  To ensure this is done consistently,
add the perf_sample_save_callchain() helper and convert all call sites.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-3-namhyung@kernel.org
2023-01-18 11:57:19 +01:00
Namhyung Kim
4cf7a13611 perf/core: Save the dynamic parts of sample data size
The perf sample data can be divided into parts.  The event->header_size
and event->id_header_size keep the static part of the sample data which
is determined by the sample_type flags.

But other parts like CALLCHAIN and BRANCH_STACK are changing dynamically
so it needs to see the actual data.  In preparation of handling repeated
calls for perf_prepare_sample(), it can save the dynamic size to the
perf sample data to avoid the duplicate work.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-2-namhyung@kernel.org
2023-01-18 11:57:19 +01:00
Petr Mladek
d551afc258 printk: Use scnprintf() to print the message about the dropped messages on a console
Use scnprintf() for printing the message about dropped messages on
a console. It returns the really written length of the message.
It prevents potential buffer overflow when the returned length is
later used to copy the buffer content.

Note that the previous code was safe because the scratch buffer was
big enough and the message always fit in. But scnprintf() makes
it more safe, definitely.

Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1530570 ("Memory - corruptions")
Fixes: c4fcc617e1 ("printk: introduce console_prepend_dropped() for dropped messages")
Link: https://lore.kernel.org/r/202301131544.D9E804CCD@keescook
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230117161031.15499-1-pmladek@suse.com
2023-01-18 10:13:50 +01:00
Thomas Weißschuh
13e1df0928 kheaders: explicitly validate existence of cpio command
If the cpio command is not available the error emitted by
gen_kheaders.so is not clear as all output of the call to cpio is
discarded:

GNU make 4.4:

  GEN     kernel/kheaders_data.tar.xz
find: 'standard output': Broken pipe
find: write error
make[2]: *** [kernel/Makefile:157: kernel/kheaders_data.tar.xz] Error 127
make[1]: *** [scripts/Makefile.build:504: kernel] Error 2

GNU make < 4.4:

  GEN     kernel/kheaders_data.tar.xz
make[2]: *** [kernel/Makefile:157: kernel/kheaders_data.tar.xz] Error 127
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [scripts/Makefile.build:504: kernel] Error 2

Add an explicit check that will trigger a clear message about the issue:

  CHK     kernel/kheaders_data.tar.xz
./kernel/gen_kheaders.sh: line 17: type: cpio: not found

The other commands executed by gen_kheaders.sh are part of a standard
installation, so they are not checked.

Reported-by: Amy Parker <apark0006@student.cerritos.edu>
Link: https://lore.kernel.org/lkml/CAPOgqxFva=tOuh1UitCSN38+28q3BNXKq19rEsVNPRzRqKqZ+g@mail.gmail.com/
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Nicolas Schier <n.schier@avm.de>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-01-18 16:34:04 +09:00
Joel Fernandes (Google)
6efdda8bec rcu: Track laziness during boot and suspend
Boot and suspend/resume should not be slowed down in kernels built with
CONFIG_RCU_LAZY=y.  In particular, suspend can sometimes fail in such
kernels.

This commit therefore adds rcu_async_hurry(), rcu_async_relax(), and
rcu_async_should_hurry() functions that track whether or not either
a boot or a suspend/resume operation is in progress.  This will
enable a later commit to refrain from laziness during those times.

Export rcu_async_should_hurry(), rcu_async_hurry(), and rcu_async_relax()
for later use by rcutorture.

[ paulmck: Apply feedback from Steve Rostedt. ]

Fixes: 3cb278e73b ("rcu: Make call_rcu() lazy to save power")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-17 20:20:11 -08:00
Jakub Kicinski
423c1d363c bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY8WHEgAKCRDbK58LschI
 g+gDAP9K24FwikbCHy145i/KHFaIk4ZDSIfjff8uyKDq73h9QwEAvBvrxko7d+dh
 EHdhJGoqufV8n5wilYOrOGN7ShMwFAg=
 =aFQl
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
bpf 2023-01-16

We've added 6 non-merge commits during the last 8 day(s) which contain
a total of 6 files changed, 22 insertions(+), 24 deletions(-).

The main changes are:

1) Mitigate a Spectre v4 leak in unprivileged BPF from speculative
   pointer-as-scalar type confusion, from Luis Gerhorst.

2) Fix a splat when pid 1 attaches a BPF program that attempts to
   send killing signal to itself, from Hao Sun.

3) Fix BPF program ID information in BPF_AUDIT_UNLOAD as well as
   PERF_BPF_EVENT_PROG_UNLOAD events, from Paul Moore.

4) Fix BPF verifier warning triggered from invalid kfunc call in
   backtrack_insn, also from Hao Sun.

5) Fix potential deadlock in htab_lock_bucket from same bucket index
   but different map_locked index, from Tonghao Zhang.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix pointer-leak due to insufficient speculative store bypass mitigation
  bpf: hash map, avoid deadlock with suitable hash mask
  bpf: remove the do_idr_lock parameter from bpf_prog_free_id()
  bpf: restore the ebpf program ID for BPF_AUDIT_UNLOAD and PERF_BPF_EVENT_PROG_UNLOAD
  bpf: Skip task with pid=1 in send_signal_common()
  bpf: Skip invalid kfunc call in backtrack_insn
====================

Link: https://lore.kernel.org/r/20230116230745.21742-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-17 19:13:02 -08:00
Jiri Olsa
700e6f853e bpf: Do not allow to load sleepable BPF_TRACE_RAW_TP program
Currently we allow to load any tracing program as sleepable,
but BPF_TRACE_RAW_TP can't sleep. Making the check explicit
for tracing programs attach types, so sleepable BPF_TRACE_RAW_TP
will fail to load.

Updating the verifier error to mention iter programs as well.

Acked-by: Song Liu <song@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230117223705.440975-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-17 16:56:04 -08:00
Jason Gunthorpe
ac8f29aef2 genirq/msi: Free the fwnode created by msi_create_device_irq_domain()
msi_create_device_irq_domain() creates a firmware node for the new domain,
which is never freed. kmemleak reports:

unreferenced object 0xffff888120ba9a00 (size 96):
  comm "systemd-modules", pid 221, jiffies 4294893411 (age 635.732s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 e0 19 8b 83 ff ff ff ff  ................
    00 00 00 00 00 00 00 00 18 9a ba 20 81 88 ff ff  ........... ....
  backtrace:
    [<000000008cdbc98d>] __irq_domain_alloc_fwnode+0x51/0x2b0
    [<00000000c57acf9d>] msi_create_device_irq_domain+0x283/0x670
    [<000000009b567982>] __pci_enable_msix_range+0x49e/0xdb0
    [<0000000077cc1445>] pci_alloc_irq_vectors_affinity+0x11f/0x1c0
    [<00000000532e9ef5>] mlx5_irq_table_create+0x24c/0x940 [mlx5_core]
    [<00000000fabd2b80>] mlx5_load+0x1fa/0x680 [mlx5_core]
    [<000000006bb22ae4>] mlx5_init_one+0x485/0x670 [mlx5_core]
    [<00000000eaa5e1ad>] probe_one+0x4c2/0x720 [mlx5_core]
    [<00000000df8efb43>] local_pci_probe+0xd6/0x170
    [<0000000085cb9924>] pci_device_probe+0x231/0x6e0

Use the proper free operation for the firmware wnode so the name is freed
during error unwind of msi_create_device_irq_domain() and also free the
node in msi_remove_device_irq_domain() if it was automatically allocated.

To avoid extra NULL pointer checks make irq_domain_free_fwnode() tolerant
of NULL.

Fixes: 27a6dea3eb ("genirq/msi: Provide msi_create/free_device_irq_domain()")
Reported-by: Omri Barazi <obarazi@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kalle Valo <kvalo@kernel.org>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/0-v2-24af6665e2da+c9-msi_leak_jgg@nvidia.com
2023-01-17 22:57:04 +01:00
Ming Lei
f7b3ea8cf7 genirq/affinity: Move group_cpus_evenly() into lib/
group_cpus_evenly() has become a generic function which can be used for
other subsystems than the interrupt subsystem, so move it into lib/.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>                                                                                                                                                                                                    
Link: https://lore.kernel.org/r/20221227022905.352674-6-ming.lei@redhat.com
2023-01-17 18:50:06 +01:00
Ming Lei
523f1ea76a genirq/affinity: Rename irq_build_affinity_masks as group_cpus_evenly
Map irq vector into group, which allows to abstract the algorithm for
a generic use case outside of the interrupt core.

Rename irq_build_affinity_masks as group_cpus_evenly, so the API can be
reused for blk-mq to make default queue mapping even though irq vectors
aren't involved.

No functional change, just rename vector as group.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>                                                                                                                                                                                                    
Link: https://lore.kernel.org/r/20221227022905.352674-5-ming.lei@redhat.com
2023-01-17 18:50:06 +01:00
Ming Lei
e7bdd7f0cb genirq/affinity: Don't pass irq_affinity_desc array to irq_build_affinity_masks
Prepare for abstracting irq_build_affinity_masks() into a public function
for assigning all CPUs evenly into several groups.

Don't pass irq_affinity_desc array to irq_build_affinity_masks, instead
return a cpumask array by storing each assigned group into one element of
the array.

This allows to provide a generic interface for grouping all CPUs evenly
from a NUMA and CPU locality viewpoint, and the cost is one extra allocation
in irq_build_affinity_masks(), which should be fine since it is done via
GFP_KERNEL and irq_build_affinity_masks() is a slow path anyway.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>                                                                                                                                                                                                    
Link: https://lore.kernel.org/r/20221227022905.352674-4-ming.lei@redhat.com
2023-01-17 18:50:06 +01:00
Ming Lei
1f962d91a1 genirq/affinity: Pass affinity managed mask array to irq_build_affinity_masks
Pass affinity managed mask array to irq_build_affinity_masks() so that the
index of the first affinity managed vector is always zero.

This allows to simplify the implementation a bit.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>                                                                                                                                                                                                    
Link: https://lore.kernel.org/r/20221227022905.352674-3-ming.lei@redhat.com
2023-01-17 18:50:06 +01:00
Ming Lei
cdf07f0ea4 genirq/affinity: Remove the 'firstvec' parameter from irq_build_affinity_masks
The 'firstvec' parameter is always same with the parameter of
'startvec', so use 'startvec' directly inside irq_build_affinity_masks().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>                                                                                                                                                                                                    
Link: https://lore.kernel.org/r/20221227022905.352674-2-ming.lei@redhat.com
2023-01-17 18:50:06 +01:00
Anuradha Weeraman
4fe59a130c kernel/printk/printk.c: Fix W=1 kernel-doc warning
Fix W=1 kernel-doc warning:

kernel/printk/printk.c:
 - Include function parameter in console_lock_spinning_disable_and_check()

Signed-off-by: Anuradha Weeraman <anuradha@debian.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230116125635.374567-1-anuradha@debian.org
2023-01-16 16:59:17 +01:00
John Ogness
3ef5abd9b5 tty: serial: kgdboc: fix mutex locking order for configure_kgdboc()
Several mutexes are taken while setting up console serial ports. In
particular, the tty_port->mutex and @console_mutex are taken:

  serial_pnp_probe
    serial8250_register_8250_port
      uart_add_one_port (locks tty_port->mutex)
        uart_configure_port
          register_console (locks @console_mutex)

In order to synchronize kgdb's tty_find_polling_driver() with
register_console(), commit 6193bc9084 ("tty: serial: kgdboc:
synchronize tty_find_polling_driver() and register_console()") takes
the @console_mutex. However, this leads to the following call chain
(with locking):

  platform_probe
    kgdboc_probe
      configure_kgdboc (locks @console_mutex)
        tty_find_polling_driver
          uart_poll_init (locks tty_port->mutex)
            uart_set_options

This is clearly deadlock potential due to the reverse lock ordering.

Since uart_set_options() requires holding @console_mutex in order to
serialize early initialization of the serial-console lock, take the
@console_mutex in uart_poll_init() instead of configure_kgdboc().

Since configure_kgdboc() was using @console_mutex for safe traversal
of the console list, change it to use the SRCU iterator instead.

Add comments to uart_set_options() kerneldoc mentioning that it
requires holding @console_mutex (aka the console_list_lock).

Fixes: 6193bc9084 ("tty: serial: kgdboc: synchronize tty_find_polling_driver() and register_console()")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
[pmladek@suse.com: Export console_srcu_read_lock_is_held() to fix build kgdboc as a module.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230112161213.1434854-1-john.ogness@linutronix.de
2023-01-16 16:44:53 +01:00
Waiman Long
5657c11678 sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs
The kernel commit 9a5418bc48 ("sched/core: Use kfree_rcu() in
do_set_cpus_allowed()") introduces a bug for kernels built with non-SMP
configs. Calling sched_setaffinity() on such a uniprocessor kernel will
cause cpumask_copy() to be called with a NULL pointer leading to general
protection fault. This is not really a problem in real use cases as
there aren't that many uniprocessor kernel configs in use and calling
sched_setaffinity() on such a uniprocessor system doesn't make sense.

Fix this problem by making sure cpumask_copy() will not be called in
such a case.

Fixes: 9a5418bc48 ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()")
Reported-by: kernel test robot <yujie.liu@intel.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230115193122.563036-1-longman@redhat.com
2023-01-16 10:07:25 +01:00
Vincent Guittot
79ba1e607d sched/fair: Limit sched slice duration
In presence of a lot of small weight tasks like sched_idle tasks, normal
or high weight tasks can see their ideal runtime (sched_slice) to increase
to hundreds ms whereas it normally stays below sysctl_sched_latency.

2 normal tasks running on a CPU will have a max sched_slice of 12ms
(half of the sched_period). This means that they will make progress
every sysctl_sched_latency period.

If we now add 1000 idle tasks on the CPU, the sched_period becomes
3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
It will even become 1500ms if the idle tasks belongs to an idle cgroup.
This means that the scheduler will look for picking another waiting task
after 609ms running time (1500ms respectively). The idle tasks change
significantly the way the 2 normal tasks interleave their running time
slot whereas they should have a small impact.

Such long sched_slice can delay significantly the release of resources
as the tasks can wait hundreds of ms before the next running slot just
because of idle tasks queued on the rq.

Cap the ideal_runtime to sysctl_sched_latency to make sure that tasks will
regularly make progress and will not be significantly impacted by
idle/background tasks queued on the rq.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20230113133613.257342-1-vincent.guittot@linaro.org
2023-01-15 09:59:00 +01:00
Linus Torvalds
8b7be52f3f modules-6.2-rc4
Just one fix for modules by Nick.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmPB5S4SHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoin0OUQALx+uch7cZvl3aznH4CAkF1U2mKuESLT
 unQsnSlxIF0BETTAdPbhcCVEtKi102Xq4R6ffCSilB33b0IGM6qY6ZUQb/GMGhg5
 qIur7EDiFksjCBmBuE1iNaAknkiKeirkJTEC4lEOl8OLXprmnuMeeR+E0BDi5sH3
 YIxexLuhyKeF9Ke4bjAJ7A4WKvh2R7yamISCUcP6GL3w7U9urSjo9FHKcUmFH9nr
 qCX/1zE0fz7iJzb9YtBNVhdgKNOYNOxa7TDNXQCVuLcZQfmQqDMBVgKdOkj6TAWX
 6L2CHT4N9IjPt9+DrlEUf6bSSwP4N4aFdyMAo6UlVXcgvEbTS/kdoMSqFOAQSAdb
 G/lzsvS3fD76VcCdZkwAFYnEhUTJ4xWTS0oaI++tu0EFX5lvRuQv2DRt0WULlD/u
 L0paUwmjtVajcSIATxRZkjoMiVD4btDRz30kaIUU/xoc1Gg/EADrSLHESaZ9eZVL
 EJ40aqLLIRBXGZrVEzvf97HIzuQiKfaPzywNvbMpxG3m0tV2pn3Z4ts/A8aO7c+O
 mBDnTURiZN6pT+xsnJBvqWrlXwPRUGwI+NjRcdPZhUyfgj5MHpEI8PAcUWy6TTUn
 H2P6x2iC3/nypqhnwjoixSptjaUWcak3R6UgwVS2YqfjePCqaq0wg9DAFDQH0Yx3
 awAOoum0Ubin
 =aa3U
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull module fix from Luis Chamberlain:
 "Just one fix for modules by Nick"

* tag 'modules-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  kallsyms: Fix scheduling with interrupts disabled in self-test
2023-01-14 08:17:27 -06:00
Randy Dunlap
0fb0624b15 seccomp: fix kernel-doc function name warning
Move the ACTION_ONLY() macro so that it is not between the kernel-doc
notation and the function definition for seccomp_run_filters(),
eliminating a kernel-doc warning:

kernel/seccomp.c:400: warning: expecting prototype for seccomp_run_filters(). Prototype was for ACTION_ONLY() instead

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230108021228.15975-1-rdunlap@infradead.org
2023-01-13 17:01:06 -08:00
Nicholas Piggin
da35048f26 kallsyms: Fix scheduling with interrupts disabled in self-test
kallsyms_on_each* may schedule so must not be called with interrupts
disabled. The iteration function could disable interrupts, but this
also changes lookup_symbol() to match the change to the other timing
code.

Reported-by: Erhard F. <erhard_f@mailbox.org>
Link: https://lore.kernel.org/all/bug-216902-206035@https.bugzilla.kernel.org%2F/
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202212251728.8d0872ff-oliver.sang@intel.com
Fixes: 30f3bb0977 ("kallsyms: Add self-test facility")
Tested-by: "Erhard F." <erhard_f@mailbox.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-01-13 15:09:08 -08:00
Valentin Schneider
c63a2e52d5 workqueue: Fold rebind_worker() within rebind_workers()
!CONFIG_SMP builds complain about rebind_worker() being unused. Its only
user, rebind_workers() is indeed only defined for CONFIG_SMP, so just fold
the two lines back up there.

Link: http://lore.kernel.org/r/20230113143102.2e94d74f@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-13 07:50:40 -10:00