1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

2331 commits

Author SHA1 Message Date
Frederic Weisbecker
77e45145e3 net: Handle napi_schedule() calls from non-interrupt
napi_schedule() is expected to be called either:

* From an interrupt, where raised softirqs are handled on IRQ exit

* From a softirq disabled section, where raised softirqs are handled on
  the next call to local_bh_enable().

* From a softirq handler, where raised softirqs are handled on the next
  round in do_softirq(), or further deferred to a dedicated kthread.

Other bare tasks context may end up ignoring the raised NET_RX vector
until the next random softirq handling opportunity, which may not
happen before a while if the CPU goes idle afterwards with the tick
stopped.

Such "misuses" have been detected on several places thanks to messages
of the kind:

	"NOHZ tick-stop error: local softirq work is pending, handler #08!!!"

For example:

       __raise_softirq_irqoff
        __napi_schedule
        rtl8152_runtime_resume.isra.0
        rtl8152_resume
        usb_resume_interface.isra.0
        usb_resume_both
        __rpm_callback
        rpm_callback
        rpm_resume
        __pm_runtime_resume
        usb_autoresume_device
        usb_remote_wakeup
        hub_event
        process_one_work
        worker_thread
        kthread
        ret_from_fork
        ret_from_fork_asm

And also:

* drivers/net/usb/r8152.c::rtl_work_func_t
* drivers/net/netdevsim/netdev.c::nsim_start_xmit

There is a long history of issues of this kind:

	019edd01d1 ("ath10k: sdio: Add missing BH locking around napi_schdule()")
	3300685893 ("idpf: disable local BH when scheduling napi for marker packets")
	e3d5d70cb4 ("net: lan78xx: fix "softirq work is pending" error")
	e55c27ed9c ("mt76: mt7615: add missing bh-disable around rx napi schedule")
	c0182aa985 ("mt76: mt7915: add missing bh-disable around tx napi enable/schedule")
	970be1dff2 ("mt76: disable BH around napi_schedule() calls")
	019edd01d1 ("ath10k: sdio: Add missing BH locking around napi_schdule()")
	30bfec4fec ("can: rx-offload: can_rx_offload_threaded_irq_finish(): add new  function to be called from threaded interrupt")
	e63052a5dd ("mlx5e: add add missing BH locking around napi_schdule()")
	83a0c6e589 ("i40e: Invoke softirqs after napi_reschedule")
	bd4ce941c8 ("mlx4: Invoke softirqs after napi_reschedule")
	8cf699ec84 ("mlx4: do not call napi_schedule() without care")
	ec13ee8014 ("virtio_net: invoke softirqs after __napi_schedule")

This shows that relying on the caller to arrange a proper context for
the softirqs to be handled while calling napi_schedule() is very fragile
and error prone. Also fixing them can also prove challenging if the
caller may be called from different kinds of contexts.

Therefore fix this from napi_schedule() itself with waking up ksoftirqd
when softirqs are raised from task contexts.

Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Francois Romieu <romieu@fr.zoreil.com>
Closes: https://lore.kernel.org/lkml/354a2690-9bbf-4ccb-8769-fa94707a9340@molgen.mpg.de/
Cc: Breno Leitao <leitao@debian.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250223221708.27130-1-frederic@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26 18:56:55 -08:00
Kuniyuki Iwashima
01c9c123db net: Use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net().
Breno Leitao reported the splat below. [0]

Commit 65161fb544 ("net: Fix dev_net(dev) race in
unregister_netdevice_notifier_dev_net().") added the
DEBUG_NET_WARN_ON_ONCE(), assuming that the netdev is not
registered before register_netdevice_notifier_dev_net().

But the assumption was simply wrong.

Let's use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net().

[0]:
WARNING: CPU: 25 PID: 849 at net/core/dev.c:2150 register_netdevice_notifier_dev_net (net/core/dev.c:2150)
 <TASK>
 ? __warn (kernel/panic.c:242 kernel/panic.c:748)
 ? register_netdevice_notifier_dev_net (net/core/dev.c:2150)
 ? register_netdevice_notifier_dev_net (net/core/dev.c:2150)
 ? report_bug (lib/bug.c:? lib/bug.c:219)
 ? handle_bug (arch/x86/kernel/traps.c:285)
 ? exc_invalid_op (arch/x86/kernel/traps.c:309)
 ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621)
 ? register_netdevice_notifier_dev_net (net/core/dev.c:2150)
 ? register_netdevice_notifier_dev_net (./include/net/net_namespace.h:406 ./include/linux/netdevice.h:2663 net/core/dev.c:2144)
 mlx5e_mdev_notifier_event+0x9f/0xf0 mlx5_ib
 notifier_call_chain.llvm.12241336988804114627 (kernel/notifier.c:85)
 blocking_notifier_call_chain (kernel/notifier.c:380)
 mlx5_core_uplink_netdev_event_replay (drivers/net/ethernet/mellanox/mlx5/core/main.c:352)
 mlx5_ib_roce_init.llvm.12447516292400117075+0x1c6/0x550 mlx5_ib
 mlx5r_probe+0x375/0x6a0 mlx5_ib
 ? kernfs_put (./include/linux/instrumented.h:96 ./include/linux/atomic/atomic-arch-fallback.h:2278 ./include/linux/atomic/atomic-instrumented.h:1384 fs/kernfs/dir.c:557)
 ? auxiliary_match_id (drivers/base/auxiliary.c:174)
 ? mlx5r_mp_remove+0x160/0x160 mlx5_ib
 really_probe (drivers/base/dd.c:? drivers/base/dd.c:658)
 driver_probe_device (drivers/base/dd.c:830)
 __driver_attach (drivers/base/dd.c:1217)
 bus_for_each_dev (drivers/base/bus.c:369)
 ? driver_attach (drivers/base/dd.c:1157)
 bus_add_driver (drivers/base/bus.c:679)
 driver_register (drivers/base/driver.c:249)

Fixes: 7fb1073300 ("net: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_dev_net().")
Reported-by: Breno Leitao <leitao@debian.org>
Closes: https://lore.kernel.org/netdev/20250224-noisy-cordial-roadrunner-fad40c@leitao/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250225211023.96448-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26 17:34:36 -08:00
Paolo Abeni
6bc7e4eb04 Revert "net: skb: introduce and use a single page frag cache"
After the previous commit is finally safe to revert commit dbae2b0628
("net: skb: introduce and use a single page frag cache"): do it here.

The intended goal of such change was to counter a performance regression
introduced by commit 3226b158e6 ("net: avoid 32 x truesize
under-estimation for tiny skbs").

Unfortunately, the blamed commit introduces another regression for the
virtio_net driver. Such a driver calls napi_alloc_skb() with a tiny
size, so that the whole head frag could fit a 512-byte block.

The single page frag cache uses a 1K fragment for such allocation, and
the additional overhead, under small UDP packets flood, makes the page
allocator a bottleneck.

Thanks to commit bf9f1baa27 ("net: add dedicated kmem_cache for
typical/small skb->head"), this revert does not re-introduce the
original regression. Actually, in the relevant test on top of this
revert, I measure a small but noticeable positive delta, just above
noise level.

The revert itself required some additional mangling due to recent updates
in the affected code.

Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: dbae2b0628 ("net: skb: introduce and use a single page frag cache")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-20 10:53:25 +01:00
Breno Leitao
4b5a28b38c net: Add non-RCU dev_getbyhwaddr() helper
Add dedicated helper for finding devices by hardware address when
holding rtnl_lock, similar to existing dev_getbyhwaddr_rcu(). This prevents
PROVE_LOCKING warnings when rtnl_lock is held but RCU read lock is not.

Extract common address comparison logic into dev_addr_cmp().

The context about this change could be found in the following
discussion:

Link: https://lore.kernel.org/all/20250206-scarlet-ermine-of-improvement-1fcac5@leitao/

Cc: kuniyu@amazon.com
Cc: ushankar@purestorage.com
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250218-arm_fix_selftest-v5-1-d3d6892db9e1@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-19 18:59:29 -08:00
Kuniyuki Iwashima
d4c6bfc839 dev: Use rtnl_net_dev_lock() in unregister_netdev().
The following sequence is basically illegal when dev was fetched
without lookup because dev_net(dev) might be different after holding
rtnl_net_lock():

  net = dev_net(dev);
  rtnl_net_lock(net);

Let's use rtnl_net_dev_lock() in unregister_netdev().

Note that there is no real bug in unregister_netdev() for now
because RTNL protects the scope even if dev_net(dev) is changed
before/after RTNL.

Fixes: 00fb982393 ("dev: Hold per-netns RTNL in (un)?register_netdev().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250217191129.19967-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18 18:33:30 -08:00
Kuniyuki Iwashima
65161fb544 net: Fix dev_net(dev) race in unregister_netdevice_notifier_dev_net().
After the cited commit, dev_net(dev) is fetched before holding RTNL
and passed to __unregister_netdevice_notifier_net().

However, dev_net(dev) might be different after holding RTNL.

In the reported case [0], while removing a VF device, its netns was
being dismantled and the VF was moved to init_net.

So the following sequence is basically illegal when dev was fetched
without lookup:

  net = dev_net(dev);
  rtnl_net_lock(net);

Let's use a new helper rtnl_net_dev_lock() to fix the race.

It fetches dev_net_rcu(dev), bumps its net->passive, and checks if
dev_net_rcu(dev) is changed after rtnl_net_lock().

[0]:
BUG: KASAN: slab-use-after-free in notifier_call_chain (kernel/notifier.c:75 (discriminator 2))
Read of size 8 at addr ffff88810cefb4c8 by task test-bridge-lag/21127
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl (lib/dump_stack.c:123)
 print_report (mm/kasan/report.c:379 mm/kasan/report.c:489)
 kasan_report (mm/kasan/report.c:604)
 notifier_call_chain (kernel/notifier.c:75 (discriminator 2))
 call_netdevice_notifiers_info (net/core/dev.c:2011)
 unregister_netdevice_many_notify (net/core/dev.c:11551)
 unregister_netdevice_queue (net/core/dev.c:11487)
 unregister_netdev (net/core/dev.c:11635)
 mlx5e_remove (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:6552 drivers/net/ethernet/mellanox/mlx5/core/en_main.c:6579) mlx5_core
 auxiliary_bus_remove (drivers/base/auxiliary.c:230)
 device_release_driver_internal (drivers/base/dd.c:1275 drivers/base/dd.c:1296)
 bus_remove_device (./include/linux/kobject.h:193 drivers/base/base.h:73 drivers/base/bus.c:583)
 device_del (drivers/base/power/power.h:142 drivers/base/core.c:3855)
 mlx5_rescan_drivers_locked (./include/linux/auxiliary_bus.h:241 drivers/net/ethernet/mellanox/mlx5/core/dev.c:333 drivers/net/ethernet/mellanox/mlx5/core/dev.c:535 drivers/net/ethernet/mellanox/mlx5/core/dev.c:549) mlx5_core
 mlx5_unregister_device (drivers/net/ethernet/mellanox/mlx5/core/dev.c:468) mlx5_core
 mlx5_uninit_one (./include/linux/instrumented.h:68 ./include/asm-generic/bitops/instrumented-non-atomic.h:141 drivers/net/ethernet/mellanox/mlx5/core/main.c:1563) mlx5_core
 remove_one (drivers/net/ethernet/mellanox/mlx5/core/main.c:965 drivers/net/ethernet/mellanox/mlx5/core/main.c:2019) mlx5_core
 pci_device_remove (./include/linux/pm_runtime.h:129 drivers/pci/pci-driver.c:475)
 device_release_driver_internal (drivers/base/dd.c:1275 drivers/base/dd.c:1296)
 unbind_store (drivers/base/bus.c:245)
 kernfs_fop_write_iter (fs/kernfs/file.c:338)
 vfs_write (fs/read_write.c:587 (discriminator 1) fs/read_write.c:679 (discriminator 1))
 ksys_write (fs/read_write.c:732)
 do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1))
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f6a4d5018b7

Fixes: 7fb1073300 ("net: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_dev_net().")
Reported-by: Yael Chemla <ychemla@nvidia.com>
Closes: https://lore.kernel.org/netdev/146eabfe-123c-4970-901e-e961b4c09bc3@nvidia.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250217191129.19967-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-18 18:33:30 -08:00
Paolo Abeni
d3ed6dee73 net: harmonize tstats and dstats
After the blamed commits below, some UDP tunnel use dstats for
accounting. On the xmit path, all the UDP-base tunnels ends up
using iptunnel_xmit_stats() for stats accounting, and the latter
assumes the relevant (tunnel) network device uses tstats.

The end result is some 'funny' stat report for the mentioned UDP
tunnel, e.g. when no packet is actually dropped and a bunch of
packets are transmitted:

gnv2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue \
		state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether ee:7d:09:87:90:ea brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
         14916      23      0      15       0       0
    TX:  bytes packets errors dropped carrier collsns
             0    1566      0       0       0       0

Address the issue ensuring the same binary layout for the overlapping
fields of dstats and tstats. While this solution is a bit hackish, is
smaller and with no performance pitfall compared to other alternatives
i.e. supporting both dstat and tstat in iptunnel_xmit_stats() or
reverting the blamed commit.

With time we should possibly move all the IP-based tunnel (and virtual
devices) to dstats.

Fixes: c77200c074 ("bareudp: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Fixes: 6fa6de3022 ("geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Fixes: be226352e8 ("vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/2e1c444cf0f63ae472baff29862c4c869be17031.1738432804.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-03 18:39:59 -08:00
Eric Dumazet
e759e1e4a4 net: revert RTNL changes in unregister_netdevice_many_notify()
This patch reverts following changes:

83419b61d1 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)
ae646f1a0b net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)
cfa579f666 net: no longer hold RTNL while calling flush_all_backlogs()

This caused issues in layers holding a private mutex:

cleanup_net()
  rtnl_lock();
	mutex_lock(subsystem_mutex);

	unregister_netdevice();

	   rtnl_unlock();		// LOCKDEP violation
	   rtnl_lock();

I will revisit this in next cycle, opt-in for the new behavior
from safe contexts only.

Fixes: cfa579f666 ("net: no longer hold RTNL while calling flush_all_backlogs()")
Fixes: ae646f1a0b ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)")
Fixes: 83419b61d1 ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)")
Reported-by: syzbot+5b9196ecf74447172a9a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6789d55f.050a0220.20d369.004e.GAE@google.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250129142726.747726-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30 08:57:18 -08:00
Toke Høiland-Jørgensen
3595599fa8 net: xdp: Disallow attaching device-bound programs in generic mode
Device-bound programs are used to support RX metadata kfuncs. These
kfuncs are driver-specific and rely on the driver context to read the
metadata. This means they can't work in generic XDP mode. However, there
is no check to disallow such programs from being attached in generic
mode, in which case the metadata kfuncs will be called in an invalid
context, leading to crashes.

Fix this by adding a check to disallow attaching device-bound programs
in generic mode.

Fixes: 2b3486bc2d ("bpf: Introduce device-bound XDP programs")
Reported-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Closes: https://lore.kernel.org/r/dae862ec-43b5-41a0-8edf-46c59071cdda@hetzner-cloud.de
Tested-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250127131344.238147-1-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 19:04:23 -08:00
Jakub Kicinski
67e4bb2ced net: page_pool: don't try to stash the napi id
Page ppol tried to cache the NAPI ID in page pool info to avoid
having a dependency on the life cycle of the NAPI instance.
Since commit under Fixes the NAPI ID is not populated until
napi_enable() and there's a good chance that page pool is
created before NAPI gets enabled.

Protect the NAPI pointer with the existing page pool mutex,
the reading path already holds it. napi_id itself we need
to READ_ONCE(), it's protected by netdev_lock() which are
not holding in page pool.

Before this patch napi IDs were missing for mlx5:

 # ./cli.py --spec netlink/specs/netdev.yaml --dump page-pool-get

 [{'id': 144, 'ifindex': 2, 'inflight': 3072, 'inflight-mem': 12582912},
  {'id': 143, 'ifindex': 2, 'inflight': 5568, 'inflight-mem': 22806528},
  {'id': 142, 'ifindex': 2, 'inflight': 5120, 'inflight-mem': 20971520},
  {'id': 141, 'ifindex': 2, 'inflight': 4992, 'inflight-mem': 20447232},
  ...

After:

 [{'id': 144, 'ifindex': 2, 'inflight': 3072, 'inflight-mem': 12582912,
   'napi-id': 565},
  {'id': 143, 'ifindex': 2, 'inflight': 4224, 'inflight-mem': 17301504,
   'napi-id': 525},
  {'id': 142, 'ifindex': 2, 'inflight': 4288, 'inflight-mem': 17563648,
   'napi-id': 524},
  ...

Fixes: 86e25f40aa ("net: napi: Add napi_config")
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250123231620.1086401-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27 14:37:41 -08:00
Jakub Kicinski
32ad1f7a05 net: provide pending ring configuration in net_device
Record the pending configuration in net_device struct.
ethtool core duplicates the current config and the specific
handlers (for now just ringparam) can modify it.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 11:44:57 -08:00
Jakub Kicinski
3c836451ca net: move HDS config from ethtool state
Separate the HDS config from the ethtool state struct.
The HDS config contains just simple parameters, not state.
Having it as a separate struct will make it easier to clone / copy
and also long term potentially make it per-queue.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20 11:44:57 -08:00
Xin Long
a12c76a033 net: sched: refine software bypass handling in tc_run
This patch addresses issues with filter counting in block (tcf_block),
particularly for software bypass scenarios, by introducing a more
accurate mechanism using useswcnt.

Previously, filtercnt and skipswcnt were introduced by:

  Commit 2081fd3445 ("net: sched: cls_api: add filter counter") and
  Commit f631ef39d8 ("net: sched: cls_api: add skip_sw counter")

  filtercnt tracked all tp (tcf_proto) objects added to a block, and
  skipswcnt counted tp objects with the skipsw attribute set.

The problem is: a single tp can contain multiple filters, some with skipsw
and others without. The current implementation fails in the case:

  When the first filter in a tp has skipsw, both skipswcnt and filtercnt
  are incremented, then adding a second filter without skipsw to the same
  tp does not modify these counters because tp->counted is already set.

  This results in bypass software behavior based solely on skipswcnt
  equaling filtercnt, even when the block includes filters without
  skipsw. Consequently, filters without skipsw are inadvertently bypassed.

To address this, the patch introduces useswcnt in block to explicitly count
tp objects containing at least one filter without skipsw. Key changes
include:

  Whenever a filter without skipsw is added, its tp is marked with usesw
  and counted in useswcnt. tc_run() now uses useswcnt to determine software
  bypass, eliminating reliance on filtercnt and skipswcnt.

  This refined approach prevents software bypass for blocks containing
  mixed filters, ensuring correct behavior in tc_run().

Additionally, as atomic operations on useswcnt ensure thread safety and
tp->lock guards access to tp->usesw and tp->counted, the broader lock
down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
and this resolves a performance regression caused by the filter counting
mechanism during parallel filter insertions.

  The improvement can be demonstrated using the following script:

  # cat insert_tc_rules.sh

    tc qdisc add dev ens1f0np0 ingress
    for i in $(seq 16); do
        taskset -c $i tc -b rules_$i.txt &
    done
    wait

  Each of rules_$i.txt files above includes 100000 tc filter rules to a
  mlx5 driver NIC ens1f0np0.

  Without this patch:

  # time sh insert_tc_rules.sh

    real    0m50.780s
    user    0m23.556s
    sys	    4m13.032s

  With this patch:

  # time sh insert_tc_rules.sh

    real    0m17.718s
    user    0m7.807s
    sys     3m45.050s

Fixes: 047f340b36 ("net: sched: make skip_sw actually skip software")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-20 09:21:27 +00:00
Eric Dumazet
185e5b8690 net: introduce netdev_napi_exit()
After 1b23cdbd2b ("net: protect netdev->napi_list with netdev_lock()")
it makes sense to iterate through dev->napi_list while holding
the device lock.

Also call synchronize_net() at most one time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250117232113.1612899-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18 17:41:21 -08:00
Eric Dumazet
bff406bc04 net: destroy dev->lock later in free_netdev()
syzbot complained that free_netdev() was calling netif_napi_del()
after dev->lock mutex has been destroyed.

This fires a warning for CONFIG_DEBUG_MUTEXES=y builds.

Move mutex_destroy(&dev->lock) near the end of free_netdev().

[1]
DEBUG_LOCKS_WARN_ON(lock->magic != lock)
 WARNING: CPU: 0 PID: 5971 at kernel/locking/mutex.c:564 __mutex_lock_common kernel/locking/mutex.c:564 [inline]
 WARNING: CPU: 0 PID: 5971 at kernel/locking/mutex.c:564 __mutex_lock+0xdac/0xee0 kernel/locking/mutex.c:735
Modules linked in:
CPU: 0 UID: 0 PID: 5971 Comm: syz-executor Not tainted 6.13.0-rc7-syzkaller-01131-g8d20dcda404d #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/27/2024
 RIP: 0010:__mutex_lock_common kernel/locking/mutex.c:564 [inline]
 RIP: 0010:__mutex_lock+0xdac/0xee0 kernel/locking/mutex.c:735
Code: 0f b6 04 38 84 c0 0f 85 1a 01 00 00 83 3d 6f 40 4c 04 00 75 19 90 48 c7 c7 60 84 0a 8c 48 c7 c6 00 85 0a 8c e8 f5 dc 91 f5 90 <0f> 0b 90 90 90 e9 c7 f3 ff ff 90 0f 0b 90 e9 29 f8 ff ff 90 0f 0b
RSP: 0018:ffffc90003317580 EFLAGS: 00010246
RAX: ee0f97edaf7b7d00 RBX: ffff8880299f8cb0 RCX: ffff8880323c9e00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90003317710 R08: ffffffff81602ac2 R09: 1ffff110170c519a
R10: dffffc0000000000 R11: ffffed10170c519b R12: 0000000000000000
R13: 0000000000000000 R14: 1ffff92000662ec4 R15: dffffc0000000000
FS:  000055557a046500(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd581d46ff8 CR3: 000000006f870000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
  netdev_lock include/linux/netdevice.h:2691 [inline]
  __netif_napi_del include/linux/netdevice.h:2829 [inline]
  netif_napi_del include/linux/netdevice.h:2848 [inline]
  free_netdev+0x2d9/0x610 net/core/dev.c:11621
  netdev_run_todo+0xf21/0x10d0 net/core/dev.c:11189
  nsim_destroy+0x3c3/0x620 drivers/net/netdevsim/netdev.c:1028
  __nsim_dev_port_del+0x14b/0x1b0 drivers/net/netdevsim/dev.c:1428
  nsim_dev_port_del_all drivers/net/netdevsim/dev.c:1440 [inline]
  nsim_dev_reload_destroy+0x28a/0x490 drivers/net/netdevsim/dev.c:1661
  nsim_drv_remove+0x58/0x160 drivers/net/netdevsim/dev.c:1676
  device_remove drivers/base/dd.c:567 [inline]

Fixes: 1b23cdbd2b ("net: protect netdev->napi_list with netdev_lock()")
Reported-by: syzbot+85ff1051228a04613a32@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/678add43.050a0220.303755.0016.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250117224626.1427577-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18 17:36:48 -08:00
Kuniyuki Iwashima
be94cfdb99 dev: Hold rtnl_net_lock() for dev_ifsioc().
Basically, dev_ifsioc() operates on the passed single netns (except
for netdev notifier chains with lower/upper devices for which we will
need more changes).

Let's hold rtnl_net_lock() for dev_ifsioc().

Now that NETDEV_CHANGENAME is always triggered under rtnl_net_lock()
of the device's netns. (do_setlink() and dev_ifsioc())

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16 17:20:50 -08:00
Kuniyuki Iwashima
2f1bb1e2cc dev: Remove devnet_rename_sem.
devnet_rename_sem is no longer used since commit
0840556e5a ("net: Protect dev->name by seqlock.").

Also, RTNL serialises dev_change_name().

Let's remove devnet_rename_sem.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16 17:20:50 -08:00
Kuniyuki Iwashima
e361560a79 dev: Acquire netdev_rename_lock before restoring dev->name in dev_change_name().
The cited commit forgot to add netdev_rename_lock in one of the
error paths in dev_change_name().

Let's hold netdev_rename_lock before restoring the old dev->name.

Fixes: 0840556e5a ("net: Protect dev->name by seqlock.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16 17:20:50 -08:00
Jakub Kicinski
707ec6279d Merge branch 'net-reduce-rtnl-pressure-in-unregister_netdevice'
Eric Dumazet says:

====================
net: reduce RTNL pressure in unregister_netdevice()

One major source of RTNL contention resides in unregister_netdevice()

Due to RCU protection of various network structures, and
unregister_netdevice() being a synchronous function,
it is calling potentially slow functions while holding RTNL.

I think we can release RTNL in two points, so that three
slow functions are called while RTNL can be used
by other threads.

v1: https://lore.kernel.org/netdev/20250107130906.098fc8d6@kernel.org/T/#m398c95f5778e1ff70938e079d3c4c43c050ad2a6
====================

Link: https://patch.msgid.link/20250114205531.967841-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:17:08 -08:00
Eric Dumazet
83419b61d1 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)
One synchronize_net() call is currently done while holding RTNL.

This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.

For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().

This should be safe, because devices are no longer visible
to other threads after unlist_netdevice() call
and setting dev->reg_state to NETREG_UNREGISTERING.

In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:17:05 -08:00
Eric Dumazet
ae646f1a0b net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)
Two synchronize_net() calls are currently done while holding RTNL.

This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.

For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().

This should be safe, because devices are no longer visible
to other threads at this point.

In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:17:04 -08:00
Eric Dumazet
cfa579f666 net: no longer hold RTNL while calling flush_all_backlogs()
flush_all_backlogs() is called from unregister_netdevice_many_notify()
as part of netdevice dismantles.

This is currently called under RTNL, and can last up to 50 ms
on busy hosts.

There is no reason to hold RTNL at this stage, if our caller
is cleanup_net() : netns are no more visible, devices
are in NETREG_UNREGISTERING state and no other thread
could mess our state while RTNL is temporarily released.

In order to provide isolation, this patch provides a separate
'net_todo_list' for cleanup_net().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:17:04 -08:00
Eric Dumazet
8a2b61e9e8 net: no longer assume RTNL is held in flush_all_backlogs()
flush_all_backlogs() uses per-cpu and static data to hold its
temporary data, on the assumption it is called under RTNL
protection.

Following patch in the series will break this assumption.

Use instead a dynamically allocated piece of memory.

In the unlikely case the allocation fails,
use a boot-time allocated memory.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:17:04 -08:00
Eric Dumazet
0734d7c3d9 net: expedite synchronize_net() for cleanup_net()
cleanup_net() is the single thread responsible
for netns dismantles, and a serious bottleneck.

Before we can get per-netns RTNL, make sure
all synchronize_net() called from this thread
are using rcu_synchronize_expedited().

v3: deal with CONFIG_NET_NS=n

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:17:03 -08:00
Jakub Kicinski
53ed30800d net: protect napi->irq with netdev_lock()
Take netdev_lock() in netif_napi_set_irq(). All NAPI "control fields"
are now protected by that lock (most of the other ones are set during
napi add/del). The napi_hash_node is fully protected by the hash
spin lock, but close enough for the kdoc...

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:34 -08:00
Jakub Kicinski
1bb86cf8f4 net: protect threaded status of NAPI with netdev_lock()
Now that NAPI instances can't come and go without holding
netdev->lock we can trivially switch from rtnl_lock() to
netdev_lock() for setting netdev->threaded via sysfs.

Note that since we do not lock netdev_lock around sysfs
calls in the core we don't have to "trylock" like we do
with rtnl_lock.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:34 -08:00
Jakub Kicinski
eeeec1d4c6 net: make netdev netlink ops hold netdev_lock()
In prep for dropping rtnl_lock, start locking netdev->lock in netlink
genl ops. We need to be using netdev->up instead of flags & IFF_UP.

We can remove the RCU lock protection for the NAPI since NAPI list
is protected by netdev->lock already.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:34 -08:00
Jakub Kicinski
413f0271f3 net: protect NAPI enablement with netdev_lock()
Wrap napi_enable() / napi_disable() with netdev_lock().
Provide the "already locked" flavor of the API.

iavf needs the usual adjustment. A number of drivers call
napi_enable() under a spin lock, so they have to be modified
to take netdev_lock() first, then spin lock then call
napi_enable_locked().

Protecting napi_enable() implies that napi->napi_id is protected
by netdev_lock().

Acked-by: Francois Romieu <romieu@fr.zoreil.com> # via-velocity
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:34 -08:00
Jakub Kicinski
1b23cdbd2b net: protect netdev->napi_list with netdev_lock()
Hold netdev->lock when NAPIs are getting added or removed.
This will allow safe access to NAPI instances of a net_device
without rtnl_lock.

Create a family of helpers which assume the lock is already taken.
Switch iavf to them, as it makes extensive use of netdev->lock,
already.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:34 -08:00
Jakub Kicinski
5112457f3d net: add netdev->up protected by netdev_lock()
Some uAPI (netdev netlink) hide net_device's sub-objects while
the interface is down to ensure uniform behavior across drivers.
To remove the rtnl_lock dependency from those uAPIs we need a way
to safely tell if the device is down or up.

Add an indication of whether device is open or closed, protected
by netdev->lock. The semantics are the same as IFF_UP, but taking
netdev_lock around every write to ->flags would be a lot of code
churn.

We don't want to blanket the entire open / close path by netdev_lock,
because it will prevent us from applying it to specific structures -
core helpers won't be able to take that lock from any function
called by the drivers on open/close paths.

So the state of the flag is "pessimistic", as in it may report false
negatives, but never false positives.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:34 -08:00
Jakub Kicinski
2628f4958c net: add helpers for lookup and walking netdevs under netdev_lock()
Add helpers for accessing netdevs under netdev_lock().
There's some careful handling needed to find the device and lock it
safely, without it getting unregistered, and without taking rtnl_lock
(the latter being the whole point of the new locking, after all).

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:33 -08:00
Jakub Kicinski
5fda3f3534 net: make netdev_lock() protect netdev->reg_state
Protect writes to netdev->reg_state with netdev_lock().
From now on holding netdev_lock() is sufficient to prevent
the net_device from getting unregistered, so code which
wants to hold just a single netdev around no longer needs
to hold rtnl_lock.

We do not protect the NETREG_UNREGISTERED -> NETREG_RELEASED
transition. We'd need to move mutex_destroy(netdev->lock)
to .release, but the real reason is that trying to stop
the unregistration process mid-way would be unsafe / crazy.
Taking references on such devices is not safe, either.
So the intended semantics are to lock REGISTERED devices.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:33 -08:00
Taehee Yoo
2d46e481a9 net: disallow setup single buffer XDP when tcp-data-split is enabled.
When a single buffer XDP is attached, NIC should guarantee only single
page packets will be received.
tcp-data-split feature splits packets into header and payload. single
buffer XDP can't handle it properly.
So attaching single buffer XDP should be disallowed when tcp-data-split
is enabled.

Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-6-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 14:42:11 -08:00
Taehee Yoo
197258f0ef net: ethtool: add hds_config member in ethtool_netdev_state
When tcp-data-split is UNKNOWN mode, drivers arbitrarily handle it.
For example, bnxt_en driver automatically enables if at least one of
LRO/GRO/JUMBO is enabled.
If tcp-data-split is UNKNOWN and LRO is enabled, a driver returns
ENABLES of tcp-data-split, not UNKNOWN.
So, `ethtool -g eth0` shows tcp-data-split is enabled.

The problem is in the setting situation.
In the ethnl_set_rings(), it first calls get_ringparam() to get the
current driver's config.
At that moment, if driver's tcp-data-split config is UNKNOWN, it returns
ENABLE if LRO/GRO/JUMBO is enabled.
Then, it sets values from the user and driver's current config to
kernel_ethtool_ringparam.
Last it calls .set_ringparam().
The driver, especially bnxt_en driver receives
ETHTOOL_TCP_DATA_SPLIT_ENABLED.
But it can't distinguish whether it is set by the user or just the
current config.

When user updates ring parameter, the new hds_config value is updated
and current hds_config value is stored to old_hdsconfig.
Driver's .set_ringparam() callback can distinguish a passed
tcp-data-split value is came from user explicitly.
If .set_ringparam() is failed, hds_config is rollbacked immediately.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-2-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 14:42:11 -08:00
Jakub Kicinski
37adf101f6 net: cleanup init_dummy_netdev_core()
init_dummy_netdev_core() used to cater to net_devices which
did not come from alloc_netdev_mqs(). Since that's no longer
supported remove the init logic which duplicates alloc_netdev_mqs().

While at it rename back to init_dummy_netdev().

Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250113003456.3904110-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-13 19:06:51 -08:00
Jakub Kicinski
f835bdae71 net: remove init_dummy_netdev()
init_dummy_netdev() can initialize statically declared or embedded
net_devices. Such netdevs did not come from alloc_netdev_mqs().
After recent work by Breno, there are the only two cases where
we have do that.

Switch those cases to alloc_netdev_mqs() and delete init_dummy_netdev().
Dealing with static netdevs is not worth the maintenance burden.

Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250113003456.3904110-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-13 19:06:51 -08:00
Jakub Kicinski
21520e74ba net: hide the definition of dev_get_by_napi_id()
There are no module callers of dev_get_by_napi_id(),
and commit d1cacd7477 ("netdev: prevent accessing NAPI instances
from another namespace") proves that getting NAPI by id
needs to be done with care. So hide dev_get_by_napi_id().

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250110004924.3212260-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-10 18:37:01 -08:00
Jakub Kicinski
14ea4cd1b1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc7).

Conflicts:
  a42d71e322 ("net_sched: sch_cake: Add drop reasons")
  737d4d91d3 ("sched: sch_cake: add bounds checks to host bulk flow fairness counts")

Adjacent changes:

drivers/net/ethernet/meta/fbnic/fbnic.h
  3a856ab347 ("eth: fbnic: add IRQ reuse support")
  95978931d5 ("eth: fbnic: Revert "eth: fbnic: Add hardware monitoring support via HWMON interface"")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-09 16:11:47 -08:00
Jakub Kicinski
d6c7b03497 net: make sure we retain NAPI ordering on netdev->napi_list
Netlink code depends on NAPI instances being sorted by ID on
the netdev list for dump continuation. We need to be able to
find the position on the list where we left off if dump does
not fit in a single skb, and in the meantime NAPI instances
can come and go.

This was trivially true when we were assigning a new ID to every
new NAPI instance. Since we added the NAPI config API, we try
to retain the ID previously used for the same queue, but still
add the new NAPI instance at the start of the list.

This is fine if we reset the entire netdev and all NAPIs get
removed and added back. If driver replaces a NAPI instance
during an operation like DEVMEM queue reset, or recreates
a subset of NAPI instances in other ways we may end up with
broken ordering, and therefore Netlink dumps with either
missing or duplicated entries.

At this stage the problem is theoretical. Only two drivers
support queue API, bnxt and gve. gve recreates NAPIs during
queue reset, but it doesn't support NAPI config.
bnxt supports NAPI config but doesn't recreate instances
during reset.

We need to save the ID in the config as soon as it is assigned
because otherwise the new NAPI will not know what ID it will
get at enable time, at the time it is being added.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-09 15:33:08 +01:00
Eric Dumazet
2170a1f091 net: no longer reset transport_header in __netif_receive_skb_core()
In commit 66e4c8d950 ("net: warn if transport header was not set")
I added a debug check in skb_transport_header() to detect
if a caller expects the transport_header to be set to a meaningful
value by a prior code path.

Unfortunately, __netif_receive_skb_core() resets the transport header
to the same value than the network header, defeating this check
in receive paths.

Pretending the transport and network headers are the same
is usually wrong.

This patch removes this reset for CONFIG_DEBUG_NET=y builds
to let fuzzers and CI find bugs.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250107144342.499759-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-09 13:18:41 +01:00
Jakub Kicinski
d1cacd7477 netdev: prevent accessing NAPI instances from another namespace
The NAPI IDs were not fully exposed to user space prior to the netlink
API, so they were never namespaced. The netlink API must ensure that
at the very least NAPI instance belongs to the same netns as the owner
of the genl sock.

napi_by_id() can become static now, but it needs to move because of
dev_get_by_napi_id().

Cc: stable@vger.kernel.org
Fixes: 1287c1ae0f ("netdev-genl: Support setting per-NAPI config values")
Fixes: 27f91aaf49 ("netdev-genl: Add netlink framework functions for napi")
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250106180137.1861472-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-08 10:21:00 -08:00
Kuniyuki Iwashima
7fb1073300 net: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_dev_net().
(un)?register_netdevice_notifier_dev_net() hold RTNL before triggering
the notifier for all netdev in the netns.

Let's convert the RTNL to rtnl_net_lock().

Note that move_netdevice_notifiers_dev_net() is assumed to be (but not
yet) protected by per-netns RTNL of both src and dst netns; we need to
convert wireless and hyperv drivers that call dev_change_net_namespace().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 17:49:20 -08:00
Kuniyuki Iwashima
ca779f4065 net: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_net().
(un)?register_netdevice_notifier_net() hold RTNL before triggering the
notifier for all netdev in the netns.

Let's convert the RTNL to rtnl_net_lock().

Note that the per-netns netdev notifier is protected by per-netns RTNL.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 17:49:20 -08:00
Kuniyuki Iwashima
a239e06250 net: Hold __rtnl_net_lock() in (un)?register_netdevice_notifier().
(un)?register_netdevice_notifier() hold pernet_ops_rwsem and RTNL,
iterate all netns, and trigger the notifier for all netdev.

Let's hold __rtnl_net_lock() before triggering the notifier.

Note that we will need protection for netdev_chain when RTNL is
removed.  (e.g. blocking_notifier conversion [0] with a lockdep
annotation [1])

Link: https://lore.kernel.org/netdev/20250104063735.36945-2-kuniyu@amazon.com/ [0]
Link: https://lore.kernel.org/netdev/20250105075957.67334-1-kuniyu@amazon.com/ [1]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 17:49:19 -08:00
Eric Dumazet
1b960cd193 net: watchdog: rename __dev_watchdog_up() and dev_watchdog_down()
In commit d7811e623d ("[NET]: Drop tx lock in dev_watchdog_up")
dev_watchdog_up() became a simple wrapper for __netdev_watchdog_up()

Herbert also said : "In 2.6.19 we can eliminate the unnecessary
__dev_watchdog_up and replace it with dev_watchdog_up."

This patch consolidates things to have only two functions, with
a common prefix.

- netdev_watchdog_up(), exported for the sake of one freescale driver.
  This replaces __netdev_watchdog_up() and dev_watchdog_up().

- netdev_watchdog_down(), static to net/sched/sch_generic.c
  This replaces dev_watchdog_down().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250105090924.1661822-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 17:43:01 -08:00
Kuniyuki Iwashima
00fb982393 dev: Hold per-netns RTNL in (un)?register_netdev().
Let's hold per-netns RTNL of dev_net(dev) in register_netdev()
and unregister_netdev().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-07 13:45:53 +01:00
Jakub Kicinski
385f186aba Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc6).

No conflicts.

Adjacent changes:

include/linux/if_vlan.h
  f91a5b8089 ("af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK")
  3f330db306 ("net: reformat kdoc return statements")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-03 16:29:29 -08:00
Willem de Bruijn
68e068cabd net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets
The blamed commit disabled hardware offoad of IPv6 packets with
extension headers on devices that advertise NETIF_F_IPV6_CSUM,
based on the definition of that feature in skbuff.h:

 *   * - %NETIF_F_IPV6_CSUM
 *     - Driver (device) is only able to checksum plain
 *       TCP or UDP packets over IPv6. These are specifically
 *       unencapsulated packets of the form IPv6|TCP or
 *       IPv6|UDP where the Next Header field in the IPv6
 *       header is either TCP or UDP. IPv6 extension headers
 *       are not supported with this feature. This feature
 *       cannot be set in features for a device with
 *       NETIF_F_HW_CSUM also set. This feature is being
 *       DEPRECATED (see below).

The change causes skb_warn_bad_offload to fire for BIG TCP
packets.

[  496.310233] WARNING: CPU: 13 PID: 23472 at net/core/dev.c:3129 skb_warn_bad_offload+0xc4/0xe0

[  496.310297]  ? skb_warn_bad_offload+0xc4/0xe0
[  496.310300]  skb_checksum_help+0x129/0x1f0
[  496.310303]  skb_csum_hwoffload_help+0x150/0x1b0
[  496.310306]  validate_xmit_skb+0x159/0x270
[  496.310309]  validate_xmit_skb_list+0x41/0x70
[  496.310312]  sch_direct_xmit+0x5c/0x250
[  496.310317]  __qdisc_run+0x388/0x620

BIG TCP introduced an IPV6_TLV_JUMBO IPv6 extension header to
communicate packet length, as this is an IPv6 jumbogram. But, the
feature is only enabled on devices that support BIG TCP TSO. The
header is only present for PF_PACKET taps like tcpdump, and not
transmitted by physical devices.

For this specific case of extension headers that are not
transmitted, return to the situation before the blamed commit
and support hardware offload.

ipv6_has_hopopt_jumbo() tests not only whether this header is present,
but also that it is the only extension header before a terminal (L4)
header.

Fixes: 04c20a9356 ("net: skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89iK1hdC3Nt8KPhOtTF8vCPc1AHDCtse_BTNki1pWxAByTQ@mail.gmail.com/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250101164909.1331680-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-02 17:13:44 -08:00
Toke Høiland-Jørgensen
e77d9aee95 xdp: register system page pool as an XDP memory model
To make the system page pool usable as a source for allocating XDP
frames, we need to register it with xdp_reg_mem_model(), so that page
return works correctly. This is done in preparation for using the system
page_pool to convert XDP_PASS XSk frames to skbs; for the same reason,
make the per-cpu variable non-static so we can access it from other
source files as well (but w/o exporting).

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241203173733.3181246-7-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05 18:41:07 -08:00
Alexander Lobakin
7cd1107f48 bpf, xdp: constify some bpf_prog * function arguments
In lots of places, bpf_prog pointer is used only for tracing or other
stuff that doesn't modify the structure itself. Same for net_device.
Address at least some of them and add `const` attributes there. The
object code didn't change, but that may prevent unwanted data
modifications and also allow more helpers to have const arguments.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05 18:41:06 -08:00