napi_schedule() is expected to be called either:
* From an interrupt, where raised softirqs are handled on IRQ exit
* From a softirq disabled section, where raised softirqs are handled on
the next call to local_bh_enable().
* From a softirq handler, where raised softirqs are handled on the next
round in do_softirq(), or further deferred to a dedicated kthread.
Other bare tasks context may end up ignoring the raised NET_RX vector
until the next random softirq handling opportunity, which may not
happen before a while if the CPU goes idle afterwards with the tick
stopped.
Such "misuses" have been detected on several places thanks to messages
of the kind:
"NOHZ tick-stop error: local softirq work is pending, handler #08!!!"
For example:
__raise_softirq_irqoff
__napi_schedule
rtl8152_runtime_resume.isra.0
rtl8152_resume
usb_resume_interface.isra.0
usb_resume_both
__rpm_callback
rpm_callback
rpm_resume
__pm_runtime_resume
usb_autoresume_device
usb_remote_wakeup
hub_event
process_one_work
worker_thread
kthread
ret_from_fork
ret_from_fork_asm
And also:
* drivers/net/usb/r8152.c::rtl_work_func_t
* drivers/net/netdevsim/netdev.c::nsim_start_xmit
There is a long history of issues of this kind:
019edd01d1 ("ath10k: sdio: Add missing BH locking around napi_schdule()")
3300685893 ("idpf: disable local BH when scheduling napi for marker packets")
e3d5d70cb4 ("net: lan78xx: fix "softirq work is pending" error")
e55c27ed9c ("mt76: mt7615: add missing bh-disable around rx napi schedule")
c0182aa985 ("mt76: mt7915: add missing bh-disable around tx napi enable/schedule")
970be1dff2 ("mt76: disable BH around napi_schedule() calls")
019edd01d1 ("ath10k: sdio: Add missing BH locking around napi_schdule()")
30bfec4fec ("can: rx-offload: can_rx_offload_threaded_irq_finish(): add new function to be called from threaded interrupt")
e63052a5dd ("mlx5e: add add missing BH locking around napi_schdule()")
83a0c6e589 ("i40e: Invoke softirqs after napi_reschedule")
bd4ce941c8 ("mlx4: Invoke softirqs after napi_reschedule")
8cf699ec84 ("mlx4: do not call napi_schedule() without care")
ec13ee8014 ("virtio_net: invoke softirqs after __napi_schedule")
This shows that relying on the caller to arrange a proper context for
the softirqs to be handled while calling napi_schedule() is very fragile
and error prone. Also fixing them can also prove challenging if the
caller may be called from different kinds of contexts.
Therefore fix this from napi_schedule() itself with waking up ksoftirqd
when softirqs are raised from task contexts.
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Francois Romieu <romieu@fr.zoreil.com>
Closes: https://lore.kernel.org/lkml/354a2690-9bbf-4ccb-8769-fa94707a9340@molgen.mpg.de/
Cc: Breno Leitao <leitao@debian.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250223221708.27130-1-frederic@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After the previous commit is finally safe to revert commit dbae2b0628
("net: skb: introduce and use a single page frag cache"): do it here.
The intended goal of such change was to counter a performance regression
introduced by commit 3226b158e6 ("net: avoid 32 x truesize
under-estimation for tiny skbs").
Unfortunately, the blamed commit introduces another regression for the
virtio_net driver. Such a driver calls napi_alloc_skb() with a tiny
size, so that the whole head frag could fit a 512-byte block.
The single page frag cache uses a 1K fragment for such allocation, and
the additional overhead, under small UDP packets flood, makes the page
allocator a bottleneck.
Thanks to commit bf9f1baa27 ("net: add dedicated kmem_cache for
typical/small skb->head"), this revert does not re-introduce the
original regression. Actually, in the relevant test on top of this
revert, I measure a small but noticeable positive delta, just above
noise level.
The revert itself required some additional mangling due to recent updates
in the affected code.
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: dbae2b0628 ("net: skb: introduce and use a single page frag cache")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The following sequence is basically illegal when dev was fetched
without lookup because dev_net(dev) might be different after holding
rtnl_net_lock():
net = dev_net(dev);
rtnl_net_lock(net);
Let's use rtnl_net_dev_lock() in unregister_netdev().
Note that there is no real bug in unregister_netdev() for now
because RTNL protects the scope even if dev_net(dev) is changed
before/after RTNL.
Fixes: 00fb982393 ("dev: Hold per-netns RTNL in (un)?register_netdev().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250217191129.19967-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After the cited commit, dev_net(dev) is fetched before holding RTNL
and passed to __unregister_netdevice_notifier_net().
However, dev_net(dev) might be different after holding RTNL.
In the reported case [0], while removing a VF device, its netns was
being dismantled and the VF was moved to init_net.
So the following sequence is basically illegal when dev was fetched
without lookup:
net = dev_net(dev);
rtnl_net_lock(net);
Let's use a new helper rtnl_net_dev_lock() to fix the race.
It fetches dev_net_rcu(dev), bumps its net->passive, and checks if
dev_net_rcu(dev) is changed after rtnl_net_lock().
[0]:
BUG: KASAN: slab-use-after-free in notifier_call_chain (kernel/notifier.c:75 (discriminator 2))
Read of size 8 at addr ffff88810cefb4c8 by task test-bridge-lag/21127
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:123)
print_report (mm/kasan/report.c:379 mm/kasan/report.c:489)
kasan_report (mm/kasan/report.c:604)
notifier_call_chain (kernel/notifier.c:75 (discriminator 2))
call_netdevice_notifiers_info (net/core/dev.c:2011)
unregister_netdevice_many_notify (net/core/dev.c:11551)
unregister_netdevice_queue (net/core/dev.c:11487)
unregister_netdev (net/core/dev.c:11635)
mlx5e_remove (drivers/net/ethernet/mellanox/mlx5/core/en_main.c:6552 drivers/net/ethernet/mellanox/mlx5/core/en_main.c:6579) mlx5_core
auxiliary_bus_remove (drivers/base/auxiliary.c:230)
device_release_driver_internal (drivers/base/dd.c:1275 drivers/base/dd.c:1296)
bus_remove_device (./include/linux/kobject.h:193 drivers/base/base.h:73 drivers/base/bus.c:583)
device_del (drivers/base/power/power.h:142 drivers/base/core.c:3855)
mlx5_rescan_drivers_locked (./include/linux/auxiliary_bus.h:241 drivers/net/ethernet/mellanox/mlx5/core/dev.c:333 drivers/net/ethernet/mellanox/mlx5/core/dev.c:535 drivers/net/ethernet/mellanox/mlx5/core/dev.c:549) mlx5_core
mlx5_unregister_device (drivers/net/ethernet/mellanox/mlx5/core/dev.c:468) mlx5_core
mlx5_uninit_one (./include/linux/instrumented.h:68 ./include/asm-generic/bitops/instrumented-non-atomic.h:141 drivers/net/ethernet/mellanox/mlx5/core/main.c:1563) mlx5_core
remove_one (drivers/net/ethernet/mellanox/mlx5/core/main.c:965 drivers/net/ethernet/mellanox/mlx5/core/main.c:2019) mlx5_core
pci_device_remove (./include/linux/pm_runtime.h:129 drivers/pci/pci-driver.c:475)
device_release_driver_internal (drivers/base/dd.c:1275 drivers/base/dd.c:1296)
unbind_store (drivers/base/bus.c:245)
kernfs_fop_write_iter (fs/kernfs/file.c:338)
vfs_write (fs/read_write.c:587 (discriminator 1) fs/read_write.c:679 (discriminator 1))
ksys_write (fs/read_write.c:732)
do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f6a4d5018b7
Fixes: 7fb1073300 ("net: Hold rtnl_net_lock() in (un)?register_netdevice_notifier_dev_net().")
Reported-by: Yael Chemla <ychemla@nvidia.com>
Closes: https://lore.kernel.org/netdev/146eabfe-123c-4970-901e-e961b4c09bc3@nvidia.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250217191129.19967-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After the blamed commits below, some UDP tunnel use dstats for
accounting. On the xmit path, all the UDP-base tunnels ends up
using iptunnel_xmit_stats() for stats accounting, and the latter
assumes the relevant (tunnel) network device uses tstats.
The end result is some 'funny' stat report for the mentioned UDP
tunnel, e.g. when no packet is actually dropped and a bunch of
packets are transmitted:
gnv2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue \
state UNKNOWN mode DEFAULT group default qlen 1000
link/ether ee:7d:09:87:90:ea brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
14916 23 0 15 0 0
TX: bytes packets errors dropped carrier collsns
0 1566 0 0 0 0
Address the issue ensuring the same binary layout for the overlapping
fields of dstats and tstats. While this solution is a bit hackish, is
smaller and with no performance pitfall compared to other alternatives
i.e. supporting both dstat and tstat in iptunnel_xmit_stats() or
reverting the blamed commit.
With time we should possibly move all the IP-based tunnel (and virtual
devices) to dstats.
Fixes: c77200c074 ("bareudp: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Fixes: 6fa6de3022 ("geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Fixes: be226352e8 ("vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/2e1c444cf0f63ae472baff29862c4c869be17031.1738432804.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch reverts following changes:
83419b61d1 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)
ae646f1a0b net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)
cfa579f666 net: no longer hold RTNL while calling flush_all_backlogs()
This caused issues in layers holding a private mutex:
cleanup_net()
rtnl_lock();
mutex_lock(subsystem_mutex);
unregister_netdevice();
rtnl_unlock(); // LOCKDEP violation
rtnl_lock();
I will revisit this in next cycle, opt-in for the new behavior
from safe contexts only.
Fixes: cfa579f666 ("net: no longer hold RTNL while calling flush_all_backlogs()")
Fixes: ae646f1a0b ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)")
Fixes: 83419b61d1 ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)")
Reported-by: syzbot+5b9196ecf74447172a9a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6789d55f.050a0220.20d369.004e.GAE@google.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250129142726.747726-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Device-bound programs are used to support RX metadata kfuncs. These
kfuncs are driver-specific and rely on the driver context to read the
metadata. This means they can't work in generic XDP mode. However, there
is no check to disallow such programs from being attached in generic
mode, in which case the metadata kfuncs will be called in an invalid
context, leading to crashes.
Fix this by adding a check to disallow attaching device-bound programs
in generic mode.
Fixes: 2b3486bc2d ("bpf: Introduce device-bound XDP programs")
Reported-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Closes: https://lore.kernel.org/r/dae862ec-43b5-41a0-8edf-46c59071cdda@hetzner-cloud.de
Tested-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250127131344.238147-1-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Page ppol tried to cache the NAPI ID in page pool info to avoid
having a dependency on the life cycle of the NAPI instance.
Since commit under Fixes the NAPI ID is not populated until
napi_enable() and there's a good chance that page pool is
created before NAPI gets enabled.
Protect the NAPI pointer with the existing page pool mutex,
the reading path already holds it. napi_id itself we need
to READ_ONCE(), it's protected by netdev_lock() which are
not holding in page pool.
Before this patch napi IDs were missing for mlx5:
# ./cli.py --spec netlink/specs/netdev.yaml --dump page-pool-get
[{'id': 144, 'ifindex': 2, 'inflight': 3072, 'inflight-mem': 12582912},
{'id': 143, 'ifindex': 2, 'inflight': 5568, 'inflight-mem': 22806528},
{'id': 142, 'ifindex': 2, 'inflight': 5120, 'inflight-mem': 20971520},
{'id': 141, 'ifindex': 2, 'inflight': 4992, 'inflight-mem': 20447232},
...
After:
[{'id': 144, 'ifindex': 2, 'inflight': 3072, 'inflight-mem': 12582912,
'napi-id': 565},
{'id': 143, 'ifindex': 2, 'inflight': 4224, 'inflight-mem': 17301504,
'napi-id': 525},
{'id': 142, 'ifindex': 2, 'inflight': 4288, 'inflight-mem': 17563648,
'napi-id': 524},
...
Fixes: 86e25f40aa ("net: napi: Add napi_config")
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250123231620.1086401-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Record the pending configuration in net_device struct.
ethtool core duplicates the current config and the specific
handlers (for now just ringparam) can modify it.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Separate the HDS config from the ethtool state struct.
The HDS config contains just simple parameters, not state.
Having it as a separate struct will make it easier to clone / copy
and also long term potentially make it per-queue.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch addresses issues with filter counting in block (tcf_block),
particularly for software bypass scenarios, by introducing a more
accurate mechanism using useswcnt.
Previously, filtercnt and skipswcnt were introduced by:
Commit 2081fd3445 ("net: sched: cls_api: add filter counter") and
Commit f631ef39d8 ("net: sched: cls_api: add skip_sw counter")
filtercnt tracked all tp (tcf_proto) objects added to a block, and
skipswcnt counted tp objects with the skipsw attribute set.
The problem is: a single tp can contain multiple filters, some with skipsw
and others without. The current implementation fails in the case:
When the first filter in a tp has skipsw, both skipswcnt and filtercnt
are incremented, then adding a second filter without skipsw to the same
tp does not modify these counters because tp->counted is already set.
This results in bypass software behavior based solely on skipswcnt
equaling filtercnt, even when the block includes filters without
skipsw. Consequently, filters without skipsw are inadvertently bypassed.
To address this, the patch introduces useswcnt in block to explicitly count
tp objects containing at least one filter without skipsw. Key changes
include:
Whenever a filter without skipsw is added, its tp is marked with usesw
and counted in useswcnt. tc_run() now uses useswcnt to determine software
bypass, eliminating reliance on filtercnt and skipswcnt.
This refined approach prevents software bypass for blocks containing
mixed filters, ensuring correct behavior in tc_run().
Additionally, as atomic operations on useswcnt ensure thread safety and
tp->lock guards access to tp->usesw and tp->counted, the broader lock
down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
and this resolves a performance regression caused by the filter counting
mechanism during parallel filter insertions.
The improvement can be demonstrated using the following script:
# cat insert_tc_rules.sh
tc qdisc add dev ens1f0np0 ingress
for i in $(seq 16); do
taskset -c $i tc -b rules_$i.txt &
done
wait
Each of rules_$i.txt files above includes 100000 tc filter rules to a
mlx5 driver NIC ens1f0np0.
Without this patch:
# time sh insert_tc_rules.sh
real 0m50.780s
user 0m23.556s
sys 4m13.032s
With this patch:
# time sh insert_tc_rules.sh
real 0m17.718s
user 0m7.807s
sys 3m45.050s
Fixes: 047f340b36 ("net: sched: make skip_sw actually skip software")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
After 1b23cdbd2b ("net: protect netdev->napi_list with netdev_lock()")
it makes sense to iterate through dev->napi_list while holding
the device lock.
Also call synchronize_net() at most one time.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250117232113.1612899-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Basically, dev_ifsioc() operates on the passed single netns (except
for netdev notifier chains with lower/upper devices for which we will
need more changes).
Let's hold rtnl_net_lock() for dev_ifsioc().
Now that NETDEV_CHANGENAME is always triggered under rtnl_net_lock()
of the device's netns. (do_setlink() and dev_ifsioc())
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
devnet_rename_sem is no longer used since commit
0840556e5a ("net: Protect dev->name by seqlock.").
Also, RTNL serialises dev_change_name().
Let's remove devnet_rename_sem.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The cited commit forgot to add netdev_rename_lock in one of the
error paths in dev_change_name().
Let's hold netdev_rename_lock before restoring the old dev->name.
Fixes: 0840556e5a ("net: Protect dev->name by seqlock.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250115095545.52709-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet says:
====================
net: reduce RTNL pressure in unregister_netdevice()
One major source of RTNL contention resides in unregister_netdevice()
Due to RCU protection of various network structures, and
unregister_netdevice() being a synchronous function,
it is calling potentially slow functions while holding RTNL.
I think we can release RTNL in two points, so that three
slow functions are called while RTNL can be used
by other threads.
v1: https://lore.kernel.org/netdev/20250107130906.098fc8d6@kernel.org/T/#m398c95f5778e1ff70938e079d3c4c43c050ad2a6
====================
Link: https://patch.msgid.link/20250114205531.967841-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
One synchronize_net() call is currently done while holding RTNL.
This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.
For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().
This should be safe, because devices are no longer visible
to other threads after unlist_netdevice() call
and setting dev->reg_state to NETREG_UNREGISTERING.
In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Two synchronize_net() calls are currently done while holding RTNL.
This is source of RTNL contention in workloads adding and deleting
many network namespaces per second, because synchronize_rcu()
and synchronize_rcu_expedited() can use 60+ ms in some cases.
For cleanup_net() use, temporarily release RTNL
while calling the last synchronize_net().
This should be safe, because devices are no longer visible
to other threads at this point.
In any case, the new netdev_lock() / netdev_unlock()
infrastructure that we are adding should allow
to fix potential issues, with a combination
of a per-device mutex and dev->reg_state awareness.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
flush_all_backlogs() is called from unregister_netdevice_many_notify()
as part of netdevice dismantles.
This is currently called under RTNL, and can last up to 50 ms
on busy hosts.
There is no reason to hold RTNL at this stage, if our caller
is cleanup_net() : netns are no more visible, devices
are in NETREG_UNREGISTERING state and no other thread
could mess our state while RTNL is temporarily released.
In order to provide isolation, this patch provides a separate
'net_todo_list' for cleanup_net().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
flush_all_backlogs() uses per-cpu and static data to hold its
temporary data, on the assumption it is called under RTNL
protection.
Following patch in the series will break this assumption.
Use instead a dynamically allocated piece of memory.
In the unlikely case the allocation fails,
use a boot-time allocated memory.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cleanup_net() is the single thread responsible
for netns dismantles, and a serious bottleneck.
Before we can get per-netns RTNL, make sure
all synchronize_net() called from this thread
are using rcu_synchronize_expedited().
v3: deal with CONFIG_NET_NS=n
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250114205531.967841-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Take netdev_lock() in netif_napi_set_irq(). All NAPI "control fields"
are now protected by that lock (most of the other ones are set during
napi add/del). The napi_hash_node is fully protected by the hash
spin lock, but close enough for the kdoc...
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that NAPI instances can't come and go without holding
netdev->lock we can trivially switch from rtnl_lock() to
netdev_lock() for setting netdev->threaded via sysfs.
Note that since we do not lock netdev_lock around sysfs
calls in the core we don't have to "trylock" like we do
with rtnl_lock.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In prep for dropping rtnl_lock, start locking netdev->lock in netlink
genl ops. We need to be using netdev->up instead of flags & IFF_UP.
We can remove the RCU lock protection for the NAPI since NAPI list
is protected by netdev->lock already.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wrap napi_enable() / napi_disable() with netdev_lock().
Provide the "already locked" flavor of the API.
iavf needs the usual adjustment. A number of drivers call
napi_enable() under a spin lock, so they have to be modified
to take netdev_lock() first, then spin lock then call
napi_enable_locked().
Protecting napi_enable() implies that napi->napi_id is protected
by netdev_lock().
Acked-by: Francois Romieu <romieu@fr.zoreil.com> # via-velocity
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hold netdev->lock when NAPIs are getting added or removed.
This will allow safe access to NAPI instances of a net_device
without rtnl_lock.
Create a family of helpers which assume the lock is already taken.
Switch iavf to them, as it makes extensive use of netdev->lock,
already.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some uAPI (netdev netlink) hide net_device's sub-objects while
the interface is down to ensure uniform behavior across drivers.
To remove the rtnl_lock dependency from those uAPIs we need a way
to safely tell if the device is down or up.
Add an indication of whether device is open or closed, protected
by netdev->lock. The semantics are the same as IFF_UP, but taking
netdev_lock around every write to ->flags would be a lot of code
churn.
We don't want to blanket the entire open / close path by netdev_lock,
because it will prevent us from applying it to specific structures -
core helpers won't be able to take that lock from any function
called by the drivers on open/close paths.
So the state of the flag is "pessimistic", as in it may report false
negatives, but never false positives.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add helpers for accessing netdevs under netdev_lock().
There's some careful handling needed to find the device and lock it
safely, without it getting unregistered, and without taking rtnl_lock
(the latter being the whole point of the new locking, after all).
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Protect writes to netdev->reg_state with netdev_lock().
From now on holding netdev_lock() is sufficient to prevent
the net_device from getting unregistered, so code which
wants to hold just a single netdev around no longer needs
to hold rtnl_lock.
We do not protect the NETREG_UNREGISTERED -> NETREG_RELEASED
transition. We'd need to move mutex_destroy(netdev->lock)
to .release, but the real reason is that trying to stop
the unregistration process mid-way would be unsafe / crazy.
Taking references on such devices is not safe, either.
So the intended semantics are to lock REGISTERED devices.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a single buffer XDP is attached, NIC should guarantee only single
page packets will be received.
tcp-data-split feature splits packets into header and payload. single
buffer XDP can't handle it properly.
So attaching single buffer XDP should be disallowed when tcp-data-split
is enabled.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-6-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When tcp-data-split is UNKNOWN mode, drivers arbitrarily handle it.
For example, bnxt_en driver automatically enables if at least one of
LRO/GRO/JUMBO is enabled.
If tcp-data-split is UNKNOWN and LRO is enabled, a driver returns
ENABLES of tcp-data-split, not UNKNOWN.
So, `ethtool -g eth0` shows tcp-data-split is enabled.
The problem is in the setting situation.
In the ethnl_set_rings(), it first calls get_ringparam() to get the
current driver's config.
At that moment, if driver's tcp-data-split config is UNKNOWN, it returns
ENABLE if LRO/GRO/JUMBO is enabled.
Then, it sets values from the user and driver's current config to
kernel_ethtool_ringparam.
Last it calls .set_ringparam().
The driver, especially bnxt_en driver receives
ETHTOOL_TCP_DATA_SPLIT_ENABLED.
But it can't distinguish whether it is set by the user or just the
current config.
When user updates ring parameter, the new hds_config value is updated
and current hds_config value is stored to old_hdsconfig.
Driver's .set_ringparam() callback can distinguish a passed
tcp-data-split value is came from user explicitly.
If .set_ringparam() is failed, hds_config is rollbacked immediately.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://patch.msgid.link/20250114142852.3364986-2-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
init_dummy_netdev_core() used to cater to net_devices which
did not come from alloc_netdev_mqs(). Since that's no longer
supported remove the init logic which duplicates alloc_netdev_mqs().
While at it rename back to init_dummy_netdev().
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250113003456.3904110-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
init_dummy_netdev() can initialize statically declared or embedded
net_devices. Such netdevs did not come from alloc_netdev_mqs().
After recent work by Breno, there are the only two cases where
we have do that.
Switch those cases to alloc_netdev_mqs() and delete init_dummy_netdev().
Dealing with static netdevs is not worth the maintenance burden.
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250113003456.3904110-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are no module callers of dev_get_by_napi_id(),
and commit d1cacd7477 ("netdev: prevent accessing NAPI instances
from another namespace") proves that getting NAPI by id
needs to be done with care. So hide dev_get_by_napi_id().
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250110004924.3212260-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Netlink code depends on NAPI instances being sorted by ID on
the netdev list for dump continuation. We need to be able to
find the position on the list where we left off if dump does
not fit in a single skb, and in the meantime NAPI instances
can come and go.
This was trivially true when we were assigning a new ID to every
new NAPI instance. Since we added the NAPI config API, we try
to retain the ID previously used for the same queue, but still
add the new NAPI instance at the start of the list.
This is fine if we reset the entire netdev and all NAPIs get
removed and added back. If driver replaces a NAPI instance
during an operation like DEVMEM queue reset, or recreates
a subset of NAPI instances in other ways we may end up with
broken ordering, and therefore Netlink dumps with either
missing or duplicated entries.
At this stage the problem is theoretical. Only two drivers
support queue API, bnxt and gve. gve recreates NAPIs during
queue reset, but it doesn't support NAPI config.
bnxt supports NAPI config but doesn't recreate instances
during reset.
We need to save the ID in the config as soon as it is assigned
because otherwise the new NAPI will not know what ID it will
get at enable time, at the time it is being added.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In commit 66e4c8d950 ("net: warn if transport header was not set")
I added a debug check in skb_transport_header() to detect
if a caller expects the transport_header to be set to a meaningful
value by a prior code path.
Unfortunately, __netif_receive_skb_core() resets the transport header
to the same value than the network header, defeating this check
in receive paths.
Pretending the transport and network headers are the same
is usually wrong.
This patch removes this reset for CONFIG_DEBUG_NET=y builds
to let fuzzers and CI find bugs.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250107144342.499759-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The NAPI IDs were not fully exposed to user space prior to the netlink
API, so they were never namespaced. The netlink API must ensure that
at the very least NAPI instance belongs to the same netns as the owner
of the genl sock.
napi_by_id() can become static now, but it needs to move because of
dev_get_by_napi_id().
Cc: stable@vger.kernel.org
Fixes: 1287c1ae0f ("netdev-genl: Support setting per-NAPI config values")
Fixes: 27f91aaf49 ("netdev-genl: Add netlink framework functions for napi")
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250106180137.1861472-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(un)?register_netdevice_notifier_dev_net() hold RTNL before triggering
the notifier for all netdev in the netns.
Let's convert the RTNL to rtnl_net_lock().
Note that move_netdevice_notifiers_dev_net() is assumed to be (but not
yet) protected by per-netns RTNL of both src and dst netns; we need to
convert wireless and hyperv drivers that call dev_change_net_namespace().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(un)?register_netdevice_notifier_net() hold RTNL before triggering the
notifier for all netdev in the netns.
Let's convert the RTNL to rtnl_net_lock().
Note that the per-netns netdev notifier is protected by per-netns RTNL.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit d7811e623d ("[NET]: Drop tx lock in dev_watchdog_up")
dev_watchdog_up() became a simple wrapper for __netdev_watchdog_up()
Herbert also said : "In 2.6.19 we can eliminate the unnecessary
__dev_watchdog_up and replace it with dev_watchdog_up."
This patch consolidates things to have only two functions, with
a common prefix.
- netdev_watchdog_up(), exported for the sake of one freescale driver.
This replaces __netdev_watchdog_up() and dev_watchdog_up().
- netdev_watchdog_down(), static to net/sched/sch_generic.c
This replaces dev_watchdog_down().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250105090924.1661822-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Let's hold per-netns RTNL of dev_net(dev) in register_netdev()
and unregister_netdev().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The blamed commit disabled hardware offoad of IPv6 packets with
extension headers on devices that advertise NETIF_F_IPV6_CSUM,
based on the definition of that feature in skbuff.h:
* * - %NETIF_F_IPV6_CSUM
* - Driver (device) is only able to checksum plain
* TCP or UDP packets over IPv6. These are specifically
* unencapsulated packets of the form IPv6|TCP or
* IPv6|UDP where the Next Header field in the IPv6
* header is either TCP or UDP. IPv6 extension headers
* are not supported with this feature. This feature
* cannot be set in features for a device with
* NETIF_F_HW_CSUM also set. This feature is being
* DEPRECATED (see below).
The change causes skb_warn_bad_offload to fire for BIG TCP
packets.
[ 496.310233] WARNING: CPU: 13 PID: 23472 at net/core/dev.c:3129 skb_warn_bad_offload+0xc4/0xe0
[ 496.310297] ? skb_warn_bad_offload+0xc4/0xe0
[ 496.310300] skb_checksum_help+0x129/0x1f0
[ 496.310303] skb_csum_hwoffload_help+0x150/0x1b0
[ 496.310306] validate_xmit_skb+0x159/0x270
[ 496.310309] validate_xmit_skb_list+0x41/0x70
[ 496.310312] sch_direct_xmit+0x5c/0x250
[ 496.310317] __qdisc_run+0x388/0x620
BIG TCP introduced an IPV6_TLV_JUMBO IPv6 extension header to
communicate packet length, as this is an IPv6 jumbogram. But, the
feature is only enabled on devices that support BIG TCP TSO. The
header is only present for PF_PACKET taps like tcpdump, and not
transmitted by physical devices.
For this specific case of extension headers that are not
transmitted, return to the situation before the blamed commit
and support hardware offload.
ipv6_has_hopopt_jumbo() tests not only whether this header is present,
but also that it is the only extension header before a terminal (L4)
header.
Fixes: 04c20a9356 ("net: skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Closes: https://lore.kernel.org/netdev/CANn89iK1hdC3Nt8KPhOtTF8vCPc1AHDCtse_BTNki1pWxAByTQ@mail.gmail.com/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250101164909.1331680-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To make the system page pool usable as a source for allocating XDP
frames, we need to register it with xdp_reg_mem_model(), so that page
return works correctly. This is done in preparation for using the system
page_pool to convert XDP_PASS XSk frames to skbs; for the same reason,
make the per-cpu variable non-static so we can access it from other
source files as well (but w/o exporting).
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241203173733.3181246-7-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In lots of places, bpf_prog pointer is used only for tracing or other
stuff that doesn't modify the structure itself. Same for net_device.
Address at least some of them and add `const` attributes there. The
object code didn't change, but that may prevent unwanted data
modifications and also allow more helpers to have const arguments.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>