1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

44766 commits

Author SHA1 Message Date
Andy Shevchenko
a2ea3cd783 irqdomain: Check virq for 0 before use in irq_dispose_mapping()
It's a bit hard to read the logic since the virq is used before checking it
for 0. Rearrange the code to make it better to understand.

This, in particular, should clearly answer the question whether the caller
needs to perform this check or not, and there are plenty of places for both
variants, confirming a confusion.

Fun fact that the new code is shorter:

  Function                                     old     new   delta
  irq_dispose_mapping                          278     271      -7
  Total: Before=11625, After=11618, chg -0.06%

when compiled by GCC on Debian for x86_64.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240405190105.3932034-1-andriy.shevchenko@linux.intel.com
2024-04-08 12:08:58 +02:00
Linus Torvalds
3520c35e5f Fix various timer bugs:
- Fix a timer migration bug that may result in missed events
  - Fix timer migration group hierarchy event updates
  - Fix a PowerPC64 build warning
  - Fix a handful of DocBook annotation bugs
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYSUpsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h+/RAAlbYzlotBMM0cqxCng5jgetTT7EfQHXl1
 zaqhx2FzEjoyhZ++kpBP03A42LumWz0TXTqRK+BicZIHWvIWz16w7xNr0dHo3+L8
 PfPTZEPb1IwSP1FKHyzEZbVWPnHtokyJBky5Qp5IG5FoNqV1pArqeadyaSbd3hIw
 A3l77wHCtXINkxjROs5EoJiOwVcJWigm4M7189EXDUKKr5nzE0hemNAKGnluQZxj
 O5gF9vv40B38MLuo3xLDxFCrY8WDcq9yhv/AtBk+952FsceSZbH29zOt1a5l2HPb
 yvBR4pMaS6x4UdzJeZTbdqDs8v9QWsCUc+qqeNYuFEJSBu9y7Qo5wec8c+Ptiu0E
 1we/g4nWRaRnXvGyS1uj448jUZgnGu61KFbCCF+guDl94zKY6TBZfVpeWrF/Xjdr
 Jq1K8zYMM/+hxlzqsVhoaL+2zAddUeWnwPcSC5J8mnVlyLJUd55Cd0OGcHimz3PV
 QcimajOcE7e/pkw0eQnRQ6qAVeWXcJY4hWoJS9Nk8F9InfDC7I8T5NgsNVb6Edyx
 fj2wE/K9lAfKevz49ieJ8ItIIus3Lzmi09pbfDmDP5J9iMyL6UMk2VXj8XAUvCdL
 qpgigP1zcluwAFqHmaym6mUsej+VL/WqsKfy6Q8LI5yNvdYtUuzfQuqGqyOyGXX0
 zJg6+qU7OAE=
 =4VkW
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:
 "Fix various timer bugs:

   - Fix a timer migration bug that may result in missed events

   - Fix timer migration group hierarchy event updates

   - Fix a PowerPC64 build warning

   - Fix a handful of DocBook annotation bugs"

* tag 'timers-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Return early on deactivation
  timers/migration: Fix ignored event due to missing CPU update
  vdso: Use CONFIG_PAGE_SHIFT in vdso/datapage.h
  timers: Fix text inconsistencies and spelling
  tick/sched: Fix struct tick_sched doc warnings
  tick/sched: Fix various kernel-doc warnings
  timers: Fix kernel-doc format and add Return values
  time/timekeeping: Fix kernel-doc warnings and typos
  time/timecounter: Fix inline documentation
2024-04-07 09:20:50 -07:00
David Vernet
a8e03b6bbb bpf: Allow invoking kfuncs from BPF_PROG_TYPE_SYSCALL progs
Currently, a set of core BPF kfuncs (e.g. bpf_task_*, bpf_cgroup_*,
bpf_cpumask_*, etc) cannot be invoked from BPF_PROG_TYPE_SYSCALL
programs. The whitelist approach taken for enabling kfuncs makes sense:
it not safe to call these kfuncs from every program type. For example,
it may not be safe to call bpf_task_acquire() in an fentry to
free_task().

BPF_PROG_TYPE_SYSCALL, on the other hand, is a perfectly safe program
type from which to invoke these kfuncs, as it's a very controlled
environment, and we should never be able to run into any of the typical
problems such as recursive invoations, acquiring references on freeing
kptrs, etc. Being able to invoke these kfuncs would be useful, as
BPF_PROG_TYPE_SYSCALL can be invoked with BPF_PROG_RUN, and would
therefore enable user space programs to synchronously call into BPF to
manipulate these kptrs.

This patch therefore enables invoking the aforementioned core kfuncs
from BPF_PROG_TYPE_SYSCALL progs.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240405143041.632519-2-void@manifault.com
2024-04-05 10:56:09 -07:00
Philo Lu
9d482da9e1 bpf: allow invoking bpf_for_each_map_elem with different maps
Taking different maps within a single bpf_for_each_map_elem call is not
allowed before, because from the second map,
bpf_insn_aux_data->map_ptr_state will be marked as *poison*. In fact
both map_ptr and state are needed to support this use case: map_ptr is
used by set_map_elem_callback_state() while poison state is needed to
determine whether to use direct call.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-3-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-05 10:31:17 -07:00
Philo Lu
0a525621b7 bpf: store both map ptr and state in bpf_insn_aux_data
Currently, bpf_insn_aux_data->map_ptr_state is used to store either
map_ptr or its poison state (i.e., BPF_MAP_PTR_POISON). Thus
BPF_MAP_PTR_POISON must be checked before reading map_ptr. In certain
cases, we may need valid map_ptr even in case of poison state.
This will be explained in next patch with bpf_for_each_map_elem()
helper.

This patch changes map_ptr_state into a new struct including both map
pointer and its state (poison/unpriv). It's in the same union with
struct bpf_loop_inline_state, so there is no extra memory overhead.
Besides, macros BPF_MAP_PTR_UNPRIV/BPF_MAP_PTR_POISON/BPF_MAP_PTR are no
longer needed.

This patch does not change any existing functionality.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-2-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-05 10:31:17 -07:00
Arnd Bergmann
58babe2718 bpf: fix perf_snapshot_branch_stack link failure
The newly added code to handle bpf_get_branch_snapshot fails to link when
CONFIG_PERF_EVENTS is disabled:

aarch64-linux-ld: kernel/bpf/verifier.o: in function `do_misc_fixups':
verifier.c:(.text+0x1090c): undefined reference to `__SCK__perf_snapshot_branch_stack'

Add a build-time check for that Kconfig symbol around the code to
remove the link time dependency.

Fixes: 314a53623c ("bpf: inline bpf_get_branch_snapshot() helper")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240405142637.577046-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-05 08:39:15 -07:00
Anna-Maria Behnsen
7a96a84bfb timers/migration: Return early on deactivation
Commit 4b6f4c5a67 ("timer/migration: Remove buggy early return on
deactivation") removed the logic to return early in tmigr_update_events()
on deactivation. With this the problem with a not properly updated first
global event in a hierarchy containing only a single group was fixed.

But when having a look at this code path with a hierarchy with more than a
single level, now unnecessary work is done (example is partially copied
from the message of the commit mentioned above):

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = 0              migrator = NONE
           active   = 0              active   = NONE
           nextevt  = T0i, T1        nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
      active         idle            idle       idle

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T1             nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". Without this
early return the following steps happen in tmigr_update_events() when
child = null and group = GRP0:0 :

  lock(GRP0:0->lock);
  timerqueue_del(GRP0:0, T0i);
  unlock(GRP0:0->lock);


                            [GRP1:0]
                         migrator = NONE
                         active   = NONE
                         nextevt  = T0:0, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T1             nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

2) The change now propagates up to the top. Then tmigr_update_events()
updates the group event of GRP0:0 and executes the following steps
(child = GRP0:0 and group = GRP0:0):

  lock(GRP0:0->lock);
  lock(GRP1:0->lock);
  evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
					in GRP0:0
  ... update GRP1:0 group event and timerqueue ...
  unlock(GRP1:0->lock);
  unlock(GRP0:0->lock);

So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
the timerqueue is redundand as this is done nevertheless in 2) when
tmigr_next_groupevt(GRP0:0) is executed.

Revert commit 4b6f4c5a67 ("timer/migration: Remove buggy early return on
deactivation") and add a condition into return path to skip the return
only, when hierarchy contains a single group. Adapt comments accordingly.

Fixes: 4b6f4c5a67 ("timer/migration: Remove buggy early return on deactivation")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/87cyr49on2.fsf@somnus
2024-04-05 11:05:16 +02:00
Frederic Weisbecker
61f7fdf8fd timers/migration: Fix ignored event due to missing CPU update
When a group event is updated with its expiry unchanged but a different
CPU, that target change may go unnoticed and the event may be propagated
up with a stale CPU value. The following depicts a scenario that has
been actually observed:

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = TGRP1:0 (T0)
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T0
      /         \
    0 (T0)       1 (T1)
    idle         idle

0) The hierarchy has 3 levels. The left part (GRP1:0) is all idle,
including CPU 0 and CPU 1 which have a timer each: T0 and T1. They have
the same expiry value.

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = KTIME_MAX
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T0
      /         \
    0 (T0)       1 (T1)
    idle         idle

1) The migrator in GRP1:1 handles remotely T0. The event is dequeued
from the top and T0 executed.

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = KTIME_MAX
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T1
      /         \
    0            1 (T1)
    idle         idle

2) The migrator in GRP1:1 fetches the next timer for CPU 0 and finds
none. But it updates the events from its groups, starting with GRP0:0
which now has T1 as its next event. So far so good.

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = KTIME_MAX
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T1
      /         \
    0            1 (T1)
    idle         idle

3) The migrator in GRP1:1 proceeds upward and updates the events in
GRP1:0. The child event TGRP0:0 is found queued with the same expiry
as before. And therefore it is left unchanged. However the target CPU
is not the same but that fact is ignored so TGRP0:0 still points to
CPU 0 when it should point to CPU 1.

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = TGRP1:0 (T0)
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T1
      /         \
    0            1 (T1)
    idle         idle

4) The propagation has reached the top level and TGRP1:0, having TGRP0:0
as its first event, also wrongly points to CPU 0. TGRP1:0 is added to
the top level group.

                       [GRP2:0]
                   migrator = GRP1:1
                   active   = GRP1:1
                   nextevt  = KTIME_MAX
                    /              \
               [GRP1:0]           [GRP1:1]
            migrator = NONE       [...]
            active   = NONE
            nextevt  = TGRP0:0 (T0)
            /           \
        [GRP0:0]       [...]
      migrator = NONE
      active   = NONE
      nextevt  = T1
      /         \
    0            1 (T1)
    idle         idle

5) The migrator in GRP1:1 dequeues the next event in top level pointing
to CPU 0. But since it actually doesn't see any real event in CPU 0, it
early returns.

6) T1 is left unhandled until either CPU 0 or CPU 1 wake up.

Some other bad scenario may involve trees with just two levels.

Fix this with unconditionally updating the CPU of the child event before
considering to early return while updating a queued event with an
unchanged expiry value.

Fixes: 7ee9887703 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/Zg2Ct6M2RJAYHgCB@localhost.localdomain
2024-04-05 11:05:16 +02:00
Andrii Nakryiko
1f2a74b41e bpf: prevent r10 register from being marked as precise
r10 is a special register that is not under BPF program's control and is
always effectively precise. The rest of precision logic assumes that
only r0-r9 SCALAR registers are marked as precise, so prevent r10 from
being marked precise.

This can happen due to signed cast instruction allowing to do something
like `r0 = (s8)r10;`, which later, if r0 needs to be precise, would lead
to an attempt to mark r10 as precise.

Prevent this with an extra check during instruction backtracking.

Fixes: 8100928c88 ("bpf: Support new sign-extension mov insns")
Reported-by: syzbot+148110ee7cf72f39f33e@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404214536.3551295-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-04 18:31:08 -07:00
Jakub Kicinski
cf1ca1f66d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/ipv4/ip_gre.c
  17af420545 ("erspan: make sure erspan_base_hdr is present in skb->head")
  5832c4a77d ("ip_tunnel: convert __be16 tunnel flags to bitmaps")
https://lore.kernel.org/all/20240402103253.3b54a1cf@canb.auug.org.au/

Adjacent changes:

net/ipv6/ip6_fib.c
  d21d40605b ("ipv6: Fix infinite recursion in fib6_dump_done().")
  5fc68320c1 ("ipv6: remove RTNL protection from inet6_dump_fib()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 18:01:07 -07:00
Linus Torvalds
c88b9b4cde Including fixes from netfilter, bluetooth and bpf.
Fairly usual collection of driver and core fixes. The large selftest
 accompanying one of the fixes is also becoming a common occurrence.
 
 Current release - regressions:
 
  - ipv6: fix infinite recursion in fib6_dump_done()
 
  - net/rds: fix possible null-deref in newly added error path
 
 Current release - new code bugs:
 
  - net: do not consume a full cacheline for system_page_pool
 
  - bpf: fix bpf_arena-related file descriptor leaks in the verifier
 
  - drv: ice: fix freeing uninitialized pointers, fixing misuse of
    the newfangled __free() auto-cleanup
 
 Previous releases - regressions:
 
  - x86/bpf: fixes the BPF JIT with retbleed=stuff
 
  - xen-netfront: add missing skb_mark_for_recycle, fix page pool
    accounting leaks, revealed by recently added explicit warning
 
  - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
    non-wildcard addresses
 
  - Bluetooth:
    - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT"
      with better workarounds to un-break some buggy Qualcomm devices
    - set conn encrypted before conn establishes, fix re-connecting
      to some headsets which use slightly unusual sequence of msgs
 
  - mptcp:
    - prevent BPF accessing lowat from a subflow socket
    - don't account accept() of non-MPC client as fallback to TCP
 
  - drv: mana: fix Rx DMA datasize and skb_over_panic
 
  - drv: i40e: fix VF MAC filter removal
 
 Previous releases - always broken:
 
  - gro: various fixes related to UDP tunnels - netns crossing problems,
    incorrect checksum conversions, and incorrect packet transformations
    which may lead to panics
 
  - bpf: support deferring bpf_link dealloc to after RCU grace period
 
  - nf_tables:
    - release batch on table validation from abort path
    - release mutex after nft_gc_seq_end from abort path
    - flush pending destroy work before exit_net release
 
  - drv: r8169: skip DASH fw status checks when DASH is disabled
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmYO91wACgkQMUZtbf5S
 IrvHBQ/+PH/hobI+o3aLqwtdVlyxhmA31bVQ0I3aTIZV7c3ideMBcfgYa8TiZM2g
 pLiBiWoJXCN0h33wgUmlUee+sBvpoPCdPjGD/g99OJyKWjVt2D7ObnSwxMfjHUoq
 dtcN2JupqHP0SHz6wPPCmnWtTLxSGUsDdKjmkHQcCRhQIGTYFkYyHcOmPgNbBjaB
 6jvmH1kE9WQTFD8QcOMaZmXQ5omoafpxxQLsgundtOWxPWHL7XNvk0B5k/ESDRG1
 ujbxwtNnOESzpxZMQ6OyZlsnN/1tWfnEvLJFYVwf9BMrOlahJT/f5b/EJ9/Xy4dC
 zkAp7Tul3uAvNRKhBNhVBTWQbnIykmiNMp1VBFmiScQAy8hcnX+6d4LKTIHxbXZK
 V3AqcUS6YU2nyMdLRkhvq9f3uxD6hcY19gQdyqgCUPOtyUAs/JPv7lXQjCuuEqkq
 urEZkigUApnEqPIrIqANJ7nXUy3U0K8qU6evOZoGZ5OdiKeNKC3+tIr+g2f1ZUZq
 a7Dkat7JH9WQ7IG8Geody6Z30K9EpSqYMTKzB5wTfmuqw6cV8bl9OAW9UOSRK0GL
 pyG8GwpkpFPkNiZdu9Zt44Pno5xdLIa1+C3QZR0r5CJWYAzCbI80MppP5veF9Mw+
 v+2v8iBWuh9iv0AUj9KJOwG5QQ+EXLUuSlhtx/DFnmn2CJ9plXI=
 =6bQI
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter, bluetooth and bpf.

  Fairly usual collection of driver and core fixes. The large selftest
  accompanying one of the fixes is also becoming a common occurrence.

  Current release - regressions:

   - ipv6: fix infinite recursion in fib6_dump_done()

   - net/rds: fix possible null-deref in newly added error path

  Current release - new code bugs:

   - net: do not consume a full cacheline for system_page_pool

   - bpf: fix bpf_arena-related file descriptor leaks in the verifier

   - drv: ice: fix freeing uninitialized pointers, fixing misuse of the
     newfangled __free() auto-cleanup

  Previous releases - regressions:

   - x86/bpf: fixes the BPF JIT with retbleed=stuff

   - xen-netfront: add missing skb_mark_for_recycle, fix page pool
     accounting leaks, revealed by recently added explicit warning

   - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
     non-wildcard addresses

   - Bluetooth:
      - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with
        better workarounds to un-break some buggy Qualcomm devices
      - set conn encrypted before conn establishes, fix re-connecting to
        some headsets which use slightly unusual sequence of msgs

   - mptcp:
      - prevent BPF accessing lowat from a subflow socket
      - don't account accept() of non-MPC client as fallback to TCP

   - drv: mana: fix Rx DMA datasize and skb_over_panic

   - drv: i40e: fix VF MAC filter removal

  Previous releases - always broken:

   - gro: various fixes related to UDP tunnels - netns crossing
     problems, incorrect checksum conversions, and incorrect packet
     transformations which may lead to panics

   - bpf: support deferring bpf_link dealloc to after RCU grace period

   - nf_tables:
      - release batch on table validation from abort path
      - release mutex after nft_gc_seq_end from abort path
      - flush pending destroy work before exit_net release

   - drv: r8169: skip DASH fw status checks when DASH is disabled"

* tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  netfilter: validate user input for expected length
  net/sched: act_skbmod: prevent kernel-infoleak
  net: usb: ax88179_178a: avoid the interface always configured as random address
  net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45()
  net: ravb: Always update error counters
  net: ravb: Always process TX descriptor ring
  netfilter: nf_tables: discard table flag update with pending basechain deletion
  netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
  netfilter: nf_tables: reject new basechain after table flag update
  netfilter: nf_tables: flush pending destroy work before exit_net release
  netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
  netfilter: nf_tables: release batch on table validation from abort path
  Revert "tg3: Remove residual error handling in tg3_suspend"
  tg3: Remove residual error handling in tg3_suspend
  net: mana: Fix Rx DMA datasize and skb_over_panic
  net/sched: fix lockdep splat in qdisc_tree_reduce_backlog()
  net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping
  net: stmmac: fix rx queue priority assignment
  net: txgbe: fix i2c dev name cannot match clkdev
  net: fec: Set mac_managed_pm during probe
  ...
2024-04-04 14:49:10 -07:00
Andrii Nakryiko
314a53623c bpf: inline bpf_get_branch_snapshot() helper
Inline bpf_get_branch_snapshot() helper using architecture-agnostic
inline BPF code which calls directly into underlying callback of
perf_snapshot_branch_stack static call. This callback is set early
during kernel initialization and is never updated or reset, so it's ok
to fetch actual implementation using static_call_query() and call
directly into it.

This change eliminates a full function call and saves one LBR entry
in PERF_SAMPLE_BRANCH_ANY LBR mode.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404002640.1774210-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-04 13:08:01 -07:00
Andrii Nakryiko
5e6a3c1ee6 bpf: make bpf_get_branch_snapshot() architecture-agnostic
perf_snapshot_branch_stack is set up in an architecture-agnostic way, so
there is no reason for BPF subsystem to keep track of which
architectures do support LBR or not. E.g., it looks like ARM64 might soon
get support for BRBE ([0]), which (with proper integration) should be
possible to utilize using this BPF helper.

perf_snapshot_branch_stack static call will point to
__static_call_return0() by default, which just returns zero, which will
lead to -ENOENT, as expected. So no need to guard anything here.

  [0] https://lore.kernel.org/linux-arm-kernel/20240125094119.2542332-1-anshuman.khandual@arm.com/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404002640.1774210-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-04 13:08:01 -07:00
Amir Goldstein
230d97d39e fsnotify: create a wrapper fsnotify_find_inode_mark()
In preparation to passing an object pointer to fsnotify_find_mark(), add
a wrapper fsnotify_find_inode_mark() and use it where possible.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20240317184154.1200192-4-amir73il@gmail.com>
2024-04-04 16:24:16 +02:00
Alexei Starovoitov
af682b767a bpf: Optimize emit_mov_imm64().
Turned out that bpf prog callback addresses, bpf prog addresses
used in bpf_trampoline, and in other cases the 64-bit address
can be represented as sign extended 32-bit value.

According to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339
"Skylake has 0.64c throughput for mov r64, imm64, vs. 0.25 for mov r32, imm32."
So use shorter encoding and faster instruction when possible.

Special care is needed in jit_subprogs(), since bpf_pseudo_func()
instruction cannot change its size during the last step of JIT.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/CAADnVQKFfpY-QZBrOU2CG8v2du8Lgyb7MNVmOZVK_yTyOdNbBA@mail.gmail.com
Link: https://lore.kernel.org/bpf/20240401233800.42737-1-alexei.starovoitov@gmail.com
2024-04-04 16:13:26 +02:00
Andrii Nakryiko
0b56e637f7 bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
Using new per-CPU BPF instruction, partially inline
bpf_map_lookup_elem() helper for per-CPU hashmap BPF map. Just like for
normal HASH map, we still generate a call into __htab_map_lookup_elem(),
but after that we resolve per-CPU element address using a new
instruction, saving on extra functions calls.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-03 10:29:56 -07:00
Andrii Nakryiko
db69718b8e bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
Using new per-CPU BPF instruction implement inlining for per-CPU ARRAY
map lookup helper, if BPF JIT support is present.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-03 10:29:56 -07:00
Andrii Nakryiko
1ae6921009 bpf: inline bpf_get_smp_processor_id() helper
If BPF JIT supports per-CPU MOV instruction, inline bpf_get_smp_processor_id()
to eliminate unnecessary function calls.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-03 10:29:56 -07:00
Andrii Nakryiko
7bdbf74463 bpf: add special internal-only MOV instruction to resolve per-CPU addrs
Add a new BPF instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF JITs.

We use a special BPF_MOV | BPF_ALU64 | BPF_X form with insn->off field
set to BPF_ADDR_PERCPU = -1. I used negative offset value to distinguish
them from positive ones used by user-exposed instructions.

Such instruction performs a resolution of a per-CPU offset stored in
a register to a valid kernel address which can be dereferenced. It is
useful in any use case where absolute address of a per-CPU data has to
be resolved (e.g., in inlining bpf_map_lookup_elem()).

BPF disassembler is also taught to recognize them to support dumping
final BPF assembly code (non-JIT'ed version).

Add arch-specific way for BPF JITs to mark support for this instructions.

This patch also adds support for these instructions in x86-64 BPF JIT.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-03 10:29:55 -07:00
Justin Stitt
2e114248e0 bpf: Replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

bpf sym names get looked up and compared/cleaned with various string
apis. This suggests they need to be NUL-terminated (strncpy() suggests
this but does not guarantee it).

|	static int compare_symbol_name(const char *name, char *namebuf)
|	{
|		cleanup_symbol_name(namebuf);
|		return strcmp(name, namebuf);
|	}

|	static void cleanup_symbol_name(char *s)
|	{
|		...
|		res = strstr(s, ".llvm.");
|		...
|	}

Use strscpy() as this method guarantees NUL-termination on the
destination buffer.

This patch also replaces two uses of strncpy() used in log.c. These are
simple replacements as postfix has been zero-initialized on the stack
and has source arguments with a size less than the destination's size.

Note that this patch uses the new 2-argument version of strscpy
introduced in commit e6584c3964 ("string: Allow 2-argument strscpy()").

Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/bpf/20240402-strncpy-kernel-bpf-core-c-v1-1-7cb07a426e78@google.com
2024-04-03 16:57:41 +02:00
Dexuan Cui
a1255ccab8 swiotlb: do not set total_used to 0 in swiotlb_create_debugfs_files()
Sometimes the readout of /sys/kernel/debug/swiotlb/io_tlb_used and
io_tlb_used_hiwater can be a huge number (e.g. 18446744073709551615),
which is actually a negative number if we use "%ld" to print the number.

When swiotlb_create_default_debugfs() is running from late_initcall,
mem->total_used may already be non-zero, because the storage driver
may have already started to perform I/O operations: if the storage
driver is built-in, its probe() callback is called before late_initcall.

swiotlb_create_debugfs_files() should not blindly set mem->total_used
and mem->used_hiwater to 0; actually it doesn't have to initialize the
fields at all, because the fields, as part of the global struct
io_tlb_default_mem, have been implicitly initialized to zero.

Also don't explicitly set mem->transient_nslabs to 0.

Fixes: 8b0977ecc8 ("swiotlb: track and report io_tlb_used high water marks in debugfs")
Fixes: 02e7656970 ("swiotlb: add debugfs to track swiotlb transient pool usage")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-04-02 17:08:09 +02:00
Michael Kelley
e8068f2d75 swiotlb: fix swiotlb_bounce() to do partial sync's correctly
In current code, swiotlb_bounce() may do partial sync's correctly in
some circumstances, but may incorrectly fail in other circumstances.
The failure cases require both of these to be true:

1) swiotlb_align_offset() returns a non-zero "offset" value
2) the tlb_addr of the partial sync area points into the first
"offset" bytes of the _second_ or subsequent swiotlb slot allocated
for the mapping

Code added in commit 868c9ddc18 ("swiotlb: add overflow checks
to swiotlb_bounce") attempts to WARN on the invalid case where
tlb_addr points into the first "offset" bytes of the _first_
allocated slot. But there's no way for swiotlb_bounce() to distinguish
the first slot from the second and subsequent slots, so the WARN
can be triggered incorrectly when #2 above is true.

Related, current code calculates an adjustment to the orig_addr stored
in the swiotlb slot. The adjustment compensates for the difference
in the tlb_addr used for the partial sync vs. the tlb_addr for the full
mapping. The adjustment is stored in the local variable tlb_offset.
But when #1 and #2 above are true, it's valid for this adjustment to
be negative. In such case the arithmetic to adjust orig_addr produces
the wrong result due to tlb_offset being declared as unsigned.

Fix these problems by removing the over-constraining validations added
in 868c9ddc18. Change the declaration of tlb_offset to be signed
instead of unsigned so the adjustment arithmetic works correctly.

Tested with a test-only hack to how swiotlb_tbl_map_single() calls
swiotlb_bounce(). Instead of calling swiotlb_bounce() just once
for the entire mapped area, do a loop with each iteration doing
only a 128 byte partial sync until the entire mapped area is
sync'ed. Then with swiotlb=force on the kernel boot line, run a
variety of raw disk writes followed by read and verification of
all bytes of the written data. The storage device has DMA
min_align_mask set, and the writes are done with a variety of
original buffer memory address alignments and overall buffer
sizes. For many of the combinations, current code triggers the
WARN statements, or the data verification fails. With the fixes,
no WARNs occur and all verifications pass.

Fixes: 5f89468e2f ("swiotlb: manipulate orig_addr when tlb_addr has offset")
Fixes: 868c9ddc18 ("swiotlb: add overflow checks to swiotlb_bounce")
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Dominique Martinet <dominique.martinet@atmark-techno.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-04-02 17:08:03 +02:00
Petr Tesarik
af133562d5 swiotlb: extend buffer pre-padding to alloc_align_mask if necessary
Allow a buffer pre-padding of up to alloc_align_mask, even if it requires
allocating additional IO TLB slots.

If the allocation alignment is bigger than IO_TLB_SIZE and min_align_mask
covers any non-zero bits in the original address between IO_TLB_SIZE and
alloc_align_mask, these bits are not preserved in the swiotlb buffer
address.

To fix this case, increase the allocation size and use a larger offset
within the allocated buffer. As a result, extra padding slots may be
allocated before the mapping start address.

Leave orig_addr in these padding slots initialized to INVALID_PHYS_ADDR.
These slots do not correspond to any CPU buffer, so attempts to sync the
data should be ignored.

The padding slots should be automatically released when the buffer is
unmapped. However, swiotlb_tbl_unmap_single() takes only the address of the
DMA buffer slot, not the first padding slot. Save the number of padding
slots in struct io_tlb_slot and use it to adjust the slot index in
swiotlb_release_slots(), so all allocated slots are properly freed.

Fixes: 2fd4fa5d3fb5 ("swiotlb: Fix alignment checks when both allocation and DMA masks are present")
Link: https://lore.kernel.org/linux-iommu/20240311210507.217daf8b@meshulam.tesarici.cz/
Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-04-02 17:07:57 +02:00
Jose Fernandez
ce09cbdd98 bpf: Improve program stats run-time calculation
This patch improves the run-time calculation for program stats by
capturing the duration as soon as possible after the program returns.

Previously, the duration included u64_stats_t operations. While the
instrumentation overhead is part of the total time spent when stats are
enabled, distinguishing between the program's native execution time and
the time spent due to instrumentation is crucial for accurate
performance analysis.

By making this change, the patch facilitates more precise optimization
of BPF programs, enabling users to understand their performance in
environments without stats enabled.

I used a virtualized environment to measure the run-time over one minute
for a basic raw_tracepoint/sys_enter program, which just increments a
local counter. Although the virtualization introduced some performance
degradation that could affect the results, I observed approximately a
16% decrease in average run-time reported by stats with this change
(310 -> 260 nsec).

Signed-off-by: Jose Fernandez <josef@netflix.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240402034010.25060-1-josef@netflix.com
2024-04-02 16:51:15 +02:00
Anton Protopopov
9dc182c58b bpf: Add a verbose message if map limit is reached
When more than 64 maps are used by a program and its subprograms the
verifier returns -E2BIG. Add a verbose message which highlights the
source of the error and also print the actual limit.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240402073347.195920-1-aspsk@isovalent.com
2024-04-02 16:12:00 +02:00
Alexander Lobakin
7d8296b250 bitops: make BYTES_TO_BITS() treewide-available
Avoid open-coding that simple expression each time by moving
BYTES_TO_BITS() from the probes code to <linux/bitops.h> to export
it to the rest of the kernel.
Simplify the macro while at it. `BITS_PER_LONG / sizeof(long)` always
equals to %BITS_PER_BYTE, regardless of the target architecture.
Do the same for the tools ecosystem as well (incl. its version of
bitops.h). The previous implementation had its implicit type of long,
while the new one is int, so adjust the format literal accordingly in
the perf code.

Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01 10:49:27 +01:00
Randy Dunlap
9e643ab59d timers: Fix text inconsistencies and spelling
Fix some text for consistency: s/lvl/level/ in a comment and use
correct/full function names in comments.

Correct spelling errors as reported by codespell.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240331172652.14086-7-rdunlap@infradead.org
2024-04-01 10:36:35 +02:00
Randy Dunlap
ba6ad57b80 tick/sched: Fix struct tick_sched doc warnings
Fix kernel-doc warnings in struct tick_sched:

  tick-sched.h:103: warning: Function parameter or struct member 'idle_sleeptime_seq' not described in 'tick_sched'
  tick-sched.h:104: warning: Excess struct member 'nohz_mode' description in 'tick_sched'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240331172652.14086-6-rdunlap@infradead.org
2024-04-01 10:36:35 +02:00
Randy Dunlap
f29536bf17 tick/sched: Fix various kernel-doc warnings
Fix a slew of kernel-doc warnings in tick-sched.c:

  tick-sched.c:650: warning: Function parameter or struct member 'now' not described in 'tick_nohz_update_jiffies'
  tick-sched.c:741: warning: No description found for return value of 'get_cpu_idle_time_us'
  tick-sched.c:767: warning: No description found for return value of 'get_cpu_iowait_time_us'
  tick-sched.c:1210: warning: No description found for return value of 'tick_nohz_idle_got_tick'
  tick-sched.c:1228: warning: No description found for return value of 'tick_nohz_get_next_hrtimer'
  tick-sched.c:1243: warning: No description found for return value of 'tick_nohz_get_sleep_length'
  tick-sched.c:1282: warning: Function parameter or struct member 'cpu' not described in 'tick_nohz_get_idle_calls_cpu'
  tick-sched.c:1282: warning: No description found for return value of 'tick_nohz_get_idle_calls_cpu'
  tick-sched.c:1294: warning: No description found for return value of 'tick_nohz_get_idle_calls'
  tick-sched.c:1577: warning: Function parameter or struct member 'hrtimer' not described in 'tick_setup_sched_timer'
  tick-sched.c:1577: warning: Excess function parameter 'mode' description in 'tick_setup_sched_timer'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240331172652.14086-5-rdunlap@infradead.org
2024-04-01 10:36:35 +02:00
Linus Torvalds
7e40c2100c Kbuild fixes for v6.9
- Deduplicate Kconfig entries for CONFIG_CXL_PMU
 
  - Fix unselectable choice entry in MIPS Kconfig, and forbid this
    structure
 
  - Remove unused include/asm-generic/export.h
 
  - Fix a NULL pointer dereference bug in modpost
 
  - Enable -Woverride-init warning consistently with W=1
 
  - Drop KCSAN flags from *.mod.c files
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmYJVK0VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGf/sP/3GOk//cQGwPyWCgtCEUo6T4yyD7
 1m2TTR0JQk/lcohSFtYk0I20rhKRqU6yAMAERmyehI66D2QY7lhiYVc16ram5y04
 x0nWxd9IqerIlGJtaWePOvNqKdCw2EP9fS9NKz58rEDMGlsSf0Rd3NEdSsWoH8td
 dECtt8yCawENAMStb/rAfsnL6kn2JIhVMyqwo0RdQfiaVT5Zk6Qgpko0Oq0ncRP2
 qdNgHbvnJdKMy81FHSBAi0QEZOYvhFNX+E+6lFfWEsX6xT+wvXddCNQzJf/YV3Cw
 Klw1tGveV7UGzlZ4fsnFrv4V6g1KO2AD3342efdDo++ypBEBpImVODc+Rp0jE9Nk
 OgdOQRe2k9a5keH0LWY0ehvDbQlSbfNxk0wNtAfo5Kk5e41nHmHJBWCwGG+cXrjJ
 mPJjSrTpuNVSaGV0kt3EskHbDBeBmIIg+5QPbldmW2qcC88kWoavkyLD3WPFsg/a
 CAuR/HqH7MDfxzvsqTCjonlVcyDKX6aW66LrQ1NCtmphI4F8mdKp746CzGlziuIm
 gjYJL/UWVlx0VebMo8dwDpaHvez4/4s6xAJcyqtA+TS5HbrQWKQuwFkiv4iWQxNd
 MvyVdzgKhcMdoXhfFpUZ0LlFvHGefJ+Z6N1FQLoQJkTirt5aqRbEAjP0VXwQB4eH
 zYygkhvvtiH9/STu
 =tx+2
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Deduplicate Kconfig entries for CONFIG_CXL_PMU

 - Fix unselectable choice entry in MIPS Kconfig, and forbid this
   structure

 - Remove unused include/asm-generic/export.h

 - Fix a NULL pointer dereference bug in modpost

 - Enable -Woverride-init warning consistently with W=1

 - Drop KCSAN flags from *.mod.c files

* tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: Fix typo HEIGTH to HEIGHT
  Documentation/llvm: Note s390 LLVM=1 support with LLVM 18.1.0 and newer
  kbuild: Disable KCSAN for autogenerated *.mod.c intermediaries
  kbuild: make -Woverride-init warnings more consistent
  modpost: do not make find_tosym() return NULL
  export.h: remove include/asm-generic/export.h
  kconfig: do not reparent the menu inside a choice block
  MIPS: move unselectable FIT_IMAGE_FDT_EPM5 out of the "System type" choice
  cxl: remove CONFIG_CXL_PMU entry in drivers/cxl/Kconfig
2024-03-31 11:23:51 -07:00
Linus Torvalds
5dad26235c - Fix an unused function warning on irqchip/irq-armada-370-xp
- Fix the IRQ sharing with pinctrl-amd and ACPI OSL
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYJSYMACgkQEsHwGGHe
 VUoWag//STcc2r8aw6PdAvtm0VlUqY1oTrvMK6jvJORXFYEFl+R9lMW7sg9jAsDo
 fCcmRWFLD1dLy+eU+QNW4WNeqBlfODHKWQsxxtP9gl93qxvRAF6gvWW06JURvIoL
 yqrJGj9P3X7WsccaQAhOPpDKg93yU+uXvhc3J3Edi0WXsdBZ2g6jsAFj35TbAPb/
 DRl+8uFO3lGzF64ygg2hUFBPYE1tAEApbHrTpahQBHwiyA3iCKg0djgI9fnIM5E0
 Q4mCtcs6JwkpRY5Z/QYLT/VJsrFmVrxRyBnPAWZ677x/QTfGZk/QKZ7kDkQuTh/k
 jmfUVeJ4hnWmBg72GwYuahqKLlPlGUeG3iePQtw2rAKpDCU7xC3QMxbKzrVTCYKU
 JG75uIzO7+0W0IBYLRAG+TYI/YVyK+UlbLV4AAu2QtEiqJ6pQ2ZLg9NcDtnJjNqU
 CdwVgOeRGLLvQnKvS3Slj3+Z1wGEIFDE/w0dl9B0YEYMKiZNPoiClE/7VEA3b4Rn
 Q8jsT9WoiiagFFwvkTpCofjvTgioOLXJ8h7adxREi9GSYQy2mxq8XsZQEViozM5W
 mBVPRd2PCS7xIRwTBMehiK6Zta56z9h7RYhhpW2oPYy38/tHyhvvlyso9DF+I9pD
 uQkOKfDDh26IRFBr0fMR8dD0Host8xDQS1h+hCwyi4iPX8sgB/A=
 =80JR
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

 - Fix an unused function warning on irqchip/irq-armada-370-xp

 - Fix the IRQ sharing with pinctrl-amd and ACPI OSL

* tag 'irq_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/armada-370-xp: Suppress unused-function warning
  genirq: Introduce IRQF_COND_ONESHOT and use it in pinctrl-amd
2024-03-31 11:04:51 -07:00
Arnd Bergmann
c40845e319 kbuild: make -Woverride-init warnings more consistent
The -Woverride-init warn about code that may be intentional or not,
but the inintentional ones tend to be real bugs, so there is a bit of
disagreement on whether this warning option should be enabled by default
and we have multiple settings in scripts/Makefile.extrawarn as well as
individual subsystems.

Older versions of clang only supported -Wno-initializer-overrides with
the same meaning as gcc's -Woverride-init, though all supported versions
now work with both. Because of this difference, an earlier cleanup of
mine accidentally turned the clang warning off for W=1 builds and only
left it on for W=2, while it's still enabled for gcc with W=1.

There is also one driver that only turns the warning off for newer
versions of gcc but not other compilers, and some but not all the
Makefiles still use a cc-disable-warning conditional that is no
longer needed with supported compilers here.

Address all of the above by removing the special cases for clang
and always turning the warning off unconditionally where it got
in the way, using the syntax that is supported by both compilers.

Fixes: 2cd3271b7a ("kbuild: avoid duplicate warning options")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Hamza Mahfooz <hamza.mahfooz@amd.com>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-03-31 11:32:26 +09:00
Alexei Starovoitov
59f2f84117 bpf: Avoid kfree_rcu() under lock in bpf_lpm_trie.
syzbot reported the following lock sequence:
cpu 2:
  grabs timer_base lock
    spins on bpf_lpm lock

cpu 1:
  grab rcu krcp lock
    spins on timer_base lock

cpu 0:
  grab bpf_lpm lock
    spins on rcu krcp lock

bpf_lpm lock can be the same.
timer_base lock can also be the same due to timer migration.
but rcu krcp lock is always per-cpu, so it cannot be the same lock.
Hence it's a false positive.
To avoid lockdep complaining move kfree_rcu() after spin_unlock.

Reported-by: syzbot+1fa663a2100308ab6eab@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240329171439.37813-1-alexei.starovoitov@gmail.com
2024-03-29 11:10:41 -07:00
Anton Protopopov
6dae957c8e bpf: fix possible file descriptor leaks in verifier
The resolve_pseudo_ldimm64() function might have leaked file
descriptors when BPF_MAP_TYPE_ARENA was used in a program (some
error paths missed a corresponding fdput). Add missing fdputs.

v2:
  remove unrelated changes from the fix

Fixes: 6082b6c328 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240329071106.67968-1-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-29 09:19:55 -07:00
Ingo Molnar
4475cd8bfd sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
SG_OVERLOADED and SG_OVERUTILIZED flags plus the sg_status bitmask are an
unnecessary complication that only make the code harder to read and slower.

We only ever set them separately:

 thule:~/tip> git grep SG_OVER kernel/sched/
 kernel/sched/fair.c:            set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
 kernel/sched/fair.c:                    *sg_status |= SG_OVERLOADED;
 kernel/sched/fair.c:                    *sg_status |= SG_OVERUTILIZED;
 kernel/sched/fair.c:                            *sg_status |= SG_OVERLOADED;
 kernel/sched/fair.c:            set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
 kernel/sched/fair.c:                                       sg_status & SG_OVERUTILIZED);
 kernel/sched/fair.c:    } else if (sg_status & SG_OVERUTILIZED) {
 kernel/sched/fair.c:            set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
 kernel/sched/sched.h:#define SG_OVERLOADED              0x1 /* More than one runnable task on a CPU. */
 kernel/sched/sched.h:#define SG_OVERUTILIZED            0x2 /* One or more CPUs are over-utilized. */
 kernel/sched/sched.h:           set_rd_overloaded(rq->rd, SG_OVERLOADED);

And use them separately, which results in suboptimal code:

                /* update overload indicator if we are at root domain */
                set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);

                /* Update over-utilization (tipping point, U >= 0) indicator */
                set_rd_overutilized_status(env->dst_rq->rd,

Introduce separate sg_overloaded and sg_overutilized flags in update_sd_lb_stats()
and its lower level functions, and change all of them to 'bool'.

Remove the now unused SG_OVERLOADED and SG_OVERUTILIZED flags.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVPhODZ8/nbsqbP@gmail.com
2024-03-29 07:53:27 +01:00
Martin KaFai Lau
e8742081db bpf: Mark bpf prog stack with kmsan_unposion_memory in interpreter mode
syzbot reported uninit memory usages during map_{lookup,delete}_elem.

==========
BUG: KMSAN: uninit-value in __dev_map_lookup_elem kernel/bpf/devmap.c:441 [inline]
BUG: KMSAN: uninit-value in dev_map_lookup_elem+0xf3/0x170 kernel/bpf/devmap.c:796
__dev_map_lookup_elem kernel/bpf/devmap.c:441 [inline]
dev_map_lookup_elem+0xf3/0x170 kernel/bpf/devmap.c:796
____bpf_map_lookup_elem kernel/bpf/helpers.c:42 [inline]
bpf_map_lookup_elem+0x5c/0x80 kernel/bpf/helpers.c:38
___bpf_prog_run+0x13fe/0xe0f0 kernel/bpf/core.c:1997
__bpf_prog_run256+0xb5/0xe0 kernel/bpf/core.c:2237
==========

The reproducer should be in the interpreter mode.

The C reproducer is trying to run the following bpf prog:

    0: (18) r0 = 0x0
    2: (18) r1 = map[id:49]
    4: (b7) r8 = 16777216
    5: (7b) *(u64 *)(r10 -8) = r8
    6: (bf) r2 = r10
    7: (07) r2 += -229
            ^^^^^^^^^^

    8: (b7) r3 = 8
    9: (b7) r4 = 0
   10: (85) call dev_map_lookup_elem#1543472
   11: (95) exit

It is due to the "void *key" (r2) passed to the helper. bpf allows uninit
stack memory access for bpf prog with the right privileges. This patch
uses kmsan_unpoison_memory() to mark the stack as initialized.

This should address different syzbot reports on the uninit "void *key"
argument during map_{lookup,delete}_elem.

Reported-by: syzbot+603bcd9b0bf1d94dbb9b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000f9ce6d061494e694@google.com/
Reported-by: syzbot+eb02dc7f03dce0ef39f3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000a5c69c06147c2238@google.com/
Reported-by: syzbot+b4e65ca24fd4d0c734c3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000ac56fb06143b6cfa@google.com/
Reported-by: syzbot+d2b113dc9fea5e1d2848@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/0000000000000d69b206142d1ff7@google.com/
Reported-by: syzbot+1a3cf6f08d68868f9db3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/0000000000006f876b061478e878@google.com/
Tested-by: syzbot+1a3cf6f08d68868f9db3@syzkaller.appspotmail.com
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240328185801.1843078-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 19:00:59 -07:00
Andrii Nakryiko
1a80dbcb2d bpf: support deferring bpf_link dealloc to after RCU grace period
BPF link for some program types is passed as a "context" which can be
used by those BPF programs to look up additional information. E.g., for
multi-kprobes and multi-uprobes, link is used to fetch BPF cookie values.

Because of this runtime dependency, when bpf_link refcnt drops to zero
there could still be active BPF programs running accessing link data.

This patch adds generic support to defer bpf_link dealloc callback to
after RCU GP, if requested. This is done by exposing two different
deallocation callbacks, one synchronous and one deferred. If deferred
one is provided, bpf_link_free() will schedule dealloc_deferred()
callback to happen after RCU GP.

BPF is using two flavors of RCU: "classic" non-sleepable one and RCU
tasks trace one. The latter is used when sleepable BPF programs are
used. bpf_link_free() accommodates that by checking underlying BPF
program's sleepable flag, and goes either through normal RCU GP only for
non-sleepable, or through RCU tasks trace GP *and* then normal RCU GP
(taking into account rcu_trace_implies_rcu_gp() optimization), if BPF
program is sleepable.

We use this for multi-kprobe and multi-uprobe links, which dereference
link during program run. We also preventively switch raw_tp link to use
deferred dealloc callback, as upcoming changes in bpf-next tree expose
raw_tp link data (specifically, cookie value) to BPF program at runtime
as well.

Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Fixes: 89ae89f53d ("bpf: Add multi uprobe link")
Reported-by: syzbot+981935d9485a560bfbcb@syzkaller.appspotmail.com
Reported-by: syzbot+2cb5a6c573e98db598cc@syzkaller.appspotmail.com
Reported-by: syzbot+62d8b26793e8a2bd0516@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240328052426.3042617-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 18:47:45 -07:00
Andrii Nakryiko
e9c856cabe bpf: put uprobe link's path and task in release callback
There is no need to delay putting either path or task to deallocation
step. It can be done right after bpf_uprobe_unregister. Between release
and dealloc, there could be still some running BPF programs, but they
don't access either task or path, only data in link->uprobes, so it is
safe to do.

On the other hand, doing path_put() in dealloc callback makes this
dealloc sleepable because path_put() itself might sleep. Which is
problematic due to the need to call uprobe's dealloc through call_rcu(),
which is what is done in the next bug fix patch. So solve the problem by
releasing these resources early.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240328052426.3042617-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 18:47:45 -07:00
Yafang Shao
ee3bad033d bpf: Mitigate latency spikes associated with freeing non-preallocated htab
Following the recent upgrade of one of our BPF programs, we encountered
significant latency spikes affecting other applications running on the same
host. After thorough investigation, we identified that these spikes were
primarily caused by the prolonged duration required to free a
non-preallocated htab with approximately 2 million keys.

Notably, our kernel configuration lacks the presence of CONFIG_PREEMPT. In
scenarios where kernel execution extends excessively, other threads might
be starved of CPU time, resulting in latency issues across the system. To
mitigate this, we've adopted a proactive approach by incorporating
cond_resched() calls within the kernel code. This ensures that during
lengthy kernel operations, the scheduler is invoked periodically to provide
opportunities for other threads to execute.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240327032022.78391-1-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 18:31:40 -07:00
Andrii Nakryiko
3124591f68 bpf: add bpf_modify_return_test_tp() kfunc triggering tracepoint
Add a simple bpf_modify_return_test_tp() kfunc, available to all program
types, that is useful for various testing and benchmarking scenarios, as
it allows to trigger most tracing BPF program types from BPF side,
allowing to do complex testing and benchmarking scenarios.

It is also attachable to for fmod_ret programs, making it a good and
simple way to trigger fmod_ret program under test/benchmark.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 18:31:40 -07:00
Haiyue Wang
55fc888ded bpf,arena: Use helper sizeof_field in struct accessors
Use the well defined helper sizeof_field() to calculate the size of a
struct member, instead of doing custom calculations.

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Link: https://lore.kernel.org/r/20240327065334.8140-1-haiyue.wang@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 18:31:36 -07:00
Mykyta Yatsenko
786bf0e7e2 bpf: improve error message for unsupported helper
BPF verifier emits "unknown func" message when given BPF program type
does not support BPF helper. This message may be confusing for users, as
important context that helper is unknown only to current program type is
not provided.

This patch changes message to "program of this type cannot use helper "
and aligns dependent code in libbpf and tests. Any suggestions on
improving/changing this message are welcome.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/r/20240325152210.377548-1-yatsenko@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-28 18:30:53 -07:00
Jakub Kicinski
5e47fbe5ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts, or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-28 17:25:57 -07:00
Linus Torvalds
50108c352d Including fixes from bpf, WiFi and netfilter.
Current release - regressions:
 
  - ipv6: fix address dump when IPv6 is disabled on an interface
 
 Current release - new code bugs:
 
  - bpf: temporarily disable atomic operations in BPF arena
 
  - nexthop: fix uninitialized variable in nla_put_nh_group_stats()
 
 Previous releases - regressions:
 
  - bpf: protect against int overflow for stack access size
 
  - hsr: fix the promiscuous mode in offload mode
 
  - wifi: don't always use FW dump trig
 
  - tls: adjust recv return with async crypto and failed copy to userspace
 
  - tcp: properly terminate timers for kernel sockets
 
  - ice: fix memory corruption bug with suspend and rebuild
 
  - at803x: fix kernel panic with at8031_probe
 
  - qeth: handle deferred cc1
 
 Previous releases - always broken:
 
  - bpf: fix bug in BPF_LDX_MEMSX
 
  - netfilter: reject table flag and netdev basechain updates
 
  - inet_defrag: prevent sk release while still in use
 
  - wifi: pick the version of SESSION_PROTECTION_NOTIF
 
  - wwan: t7xx: split 64bit accesses to fix alignment issues
 
  - mlxbf_gige: call request_irq() after NAPI initialized
 
  - hns3: fix kernel crash when devlink reload during pf initialization
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmYFezkSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkdAUP/3SYNsFNIkh0/jwQqO9qBLJfI4suFjYG
 +s8jOGdCiA7n7aSgzv/RgGZ7XNqOegW3mpPRHecVVZcDu5I9y9N4AOhTDQG84TM/
 65YatgWpiZJT74oVEpoA8zcnmb4CCGYdWAxJCQZUKXoLjMAMPWelU4ee6VwonxGy
 GJ97+a4AxTXGvmQTi3rz0HLrSHQaizA+D7YP7YD8JczkG7I7kcAIR+SUWVKLSuw0
 VJnbko7RPIe3vdFFlMFypPgpZASjnO0O8g60s+eruazarEpMZE2+RqPfyz0nEg+u
 IK3W9zRw7r0PMkKqk9PoSaRjsIaNqIZBJR2Smh2cLMIpEB4CUvEFLi7WAshIdyUC
 +LBN9um3Ep3vLYh4nyuU3FzAyqdsqEo6+ayJCTRKq91xv9LrLmIN16IQpAqaRikb
 LJAuiaASwIpyu1FxBuTv41mLEUKtpm7ooziomHTJ7KbtzSf4QevRMBtorrB5t7VH
 l4yvp9ymcwHE79q8nrak1JH1JI/kCT5ZEPSqcOU5UNKSf6INjWqUTJedqZdVa5wB
 WiSZBixAmsc7DgZzARWKotRkgBEDyGeeHwrNLo/2kS8rS+hUCf6mSafpTZiPI/kL
 e+SVh+9RA8elFIF3sBV0VPcyt35G+if8o1NG1/2OTDPvZEkIz21eJhJgGyxRMHCD
 cpVSRBkU+np3
 =HbtI
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf, WiFi and netfilter.

  Current release - regressions:

   - ipv6: fix address dump when IPv6 is disabled on an interface

  Current release - new code bugs:

   - bpf: temporarily disable atomic operations in BPF arena

   - nexthop: fix uninitialized variable in nla_put_nh_group_stats()

  Previous releases - regressions:

   - bpf: protect against int overflow for stack access size

   - hsr: fix the promiscuous mode in offload mode

   - wifi: don't always use FW dump trig

   - tls: adjust recv return with async crypto and failed copy to
     userspace

   - tcp: properly terminate timers for kernel sockets

   - ice: fix memory corruption bug with suspend and rebuild

   - at803x: fix kernel panic with at8031_probe

   - qeth: handle deferred cc1

  Previous releases - always broken:

   - bpf: fix bug in BPF_LDX_MEMSX

   - netfilter: reject table flag and netdev basechain updates

   - inet_defrag: prevent sk release while still in use

   - wifi: pick the version of SESSION_PROTECTION_NOTIF

   - wwan: t7xx: split 64bit accesses to fix alignment issues

   - mlxbf_gige: call request_irq() after NAPI initialized

   - hns3: fix kernel crash when devlink reload during pf
     initialization"

* tag 'net-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  inet: inet_defrag: prevent sk release while still in use
  Octeontx2-af: fix pause frame configuration in GMP mode
  net: lan743x: Add set RFE read fifo threshold for PCI1x1x chips
  net: bcmasp: Remove phy_{suspend/resume}
  net: bcmasp: Bring up unimac after PHY link up
  net: phy: qcom: at803x: fix kernel panic with at8031_probe
  netfilter: arptables: Select NETFILTER_FAMILY_ARP when building arp_tables.c
  netfilter: nf_tables: skip netdev hook unregistration if table is dormant
  netfilter: nf_tables: reject table flag and netdev basechain updates
  netfilter: nf_tables: reject destroy command to remove basechain hooks
  bpf: update BPF LSM designated reviewer list
  bpf: Protect against int overflow for stack access size
  bpf: Check bloom filter map value size
  bpf: fix warning for crash_kexec
  selftests: netdevsim: set test timeout to 10 minutes
  net: wan: framer: Add missing static inline qualifiers
  mlxbf_gige: call request_irq() after NAPI initialized
  tls: get psock ref after taking rxlock to avoid leak
  selftests: tls: add test with a partially invalid iov
  tls: adjust recv return with async crypto and failed copy to userspace
  ...
2024-03-28 13:09:37 -07:00
Ingo Molnar
4d0a63e5b8 sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
The _status() postfix has no real meaning, simplify the naming
and harmonize it with set_rd_overloaded().

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28 11:54:42 +01:00
Ingo Molnar
7bda10ba7f sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
Follow the rename of the root_domain::overloaded flag.

Note that this also matches the SG_OVERUTILIZED flag better.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28 11:44:44 +01:00
Ingo Molnar
76cc4f9148 sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
Follow the rename of the root_domain::overloaded flag.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28 11:41:31 +01:00
Ingo Molnar
dfb83ef7b8 sched/fair: Rename root_domain::overload to ::overloaded
It is silly to use an ambiguous noun instead of a clear adjective when naming
such a flag ...

Note how root_domain::overutilized already used a proper adjective.

rd->overloaded is now set to 1 when the root domain is overloaded.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
2024-03-28 11:38:58 +01:00
Shrikanth Hegde
caac629172 sched/fair: Use helper functions to access root_domain::overload
Introduce two helper functions to access & set the root_domain::overload flag:

  get_rd_overload()
  set_rd_overload()

To make sure code is always following READ_ONCE()/WRITE_ONCE() access methods.

No change in functionality intended.

[ mingo: Renamed the accessors to get_/set_rd_overload(), tidied up the changelog. ]

Suggested-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240325054505.201995-3-sshegde@linux.ibm.com
2024-03-28 11:30:13 +01:00
Shrikanth Hegde
c628db0a68 sched/fair: Check root_domain::overload value before update
The root_domain::overload flag is 1 when there's any rq
in the root domain that has 2 or more running tasks. (Ie. it's overloaded.)

The root_domain structure itself is a global structure per cpuset island.

The ::overload flag is maintained the following way:

  - Set when adding a second task to the runqueue.

  - It is cleared in update_sd_lb_stats() during load balance,
    if none of the rqs have 2 or more running tasks.

This flag is used during newidle balance to see if its worth doing a full
load balance pass, which can be an expensive operation. If it is set,
then newidle balance will try to aggressively pull a task.

Since commit:

  630246a06a ("sched/fair: Clean-up update_sg_lb_stats parameters")

::overload is being written unconditionally, even if it has the same
value. The change in value of this depends on the workload, but on
typical workloads, it doesn't change all that often: a system is
either dominantly overloaded for substantial amounts of time, or not.

Extra writes to this semi-global structure cause unnecessary overhead, extra
bus traffic, etc. - so avoid it as much as possible.

Perf probe stats show that it's worth making this change (numbers are
with patch applied):

	1M    probe:sched_balance_newidle_L38
	139   probe:update_sd_lb_stats_L53     <====== 1->0 writes
	129K  probe:add_nr_running_L12
	74    probe:add_nr_running_L13         <====== 0->1 writes
	54K   probe:update_sd_lb_stats_L50     <====== reads

These numbers prove that actual change in the ::overload value is (much) less
frequent: L50 is much larger at ~54,000 accesses vs L53+L13 of 139+74.

[ mingo: Rewrote the changelog. ]

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20240325054505.201995-2-sshegde@linux.ibm.com
2024-03-28 11:30:13 +01:00