Make napi_hash_lock IRQ safe. It is used during the control path, and is
taken and released in napi_hash_add and napi_hash_del, which will
typically be called by calls to napi_enable and napi_disable.
This change avoids a deadlock in pcnet32 (and other any other drivers
which follow the same pattern):
CPU 0:
pcnet32_open
spin_lock_irqsave(&lp->lock, ...)
napi_enable
napi_hash_add <- before this executes, CPU 1 proceeds
spin_lock(napi_hash_lock)
[...]
spin_unlock_irqrestore(&lp->lock, flags);
CPU 1:
pcnet32_close
napi_disable
napi_hash_del
spin_lock(napi_hash_lock)
< INTERRUPT >
pcnet32_interrupt
spin_lock(lp->lock) <- DEADLOCK
Changing the napi_hash_lock to be IRQ safe prevents the IRQ from firing
on CPU 1 until napi_hash_lock is released, preventing the deadlock.
Cc: stable@vger.kernel.org
Fixes: 86e25f40aa ("net: napi: Add napi_config")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/netdev/85dd4590-ea6b-427d-876a-1d8559c7ad82@roeck-us.net/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241202182103.363038-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The napi_suspend_irqs routine bootstraps irq suspension by elongating
the defer timeout to irq_suspend_timeout.
The napi_resume_irqs routine effectively cancels irq suspension by
forcing the napi to be scheduled immediately.
Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-3-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a per-NAPI IRQ suspension parameter, which can be get/set with
netdev-genl.
This patch doesn't change any behavior but prepares the code for other
changes in the following commits which use irq_suspend_timeout as a
timeout for IRQ suspension.
Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As documented in skbuff.h, devices with NETIF_F_IPV6_CSUM capability
can only checksum TCP and UDP over IPv6 if the IP header does not
contains extension.
This is enforced for UDP packets emitted from user-space to an IPv6
address as they go through ip6_make_skb(), which calls
__ip6_append_data() where a check is done on the header size before
setting CHECKSUM_PARTIAL.
But the introduction of UDP encapsulation with fou6 added a code-path
where it is possible to get an skb with a partial UDP checksum and an
IPv6 header with extension:
* fou6 adds a UDP header with a partial checksum if the inner packet
does not contains a valid checksum.
* ip6_tunnel adds an IPv6 header with a destination option extension
header if encap_limit is non-zero (the default value is 4).
The thread linked below describes in more details how to reproduce the
problem with GRE-in-UDP tunnel.
Add a check on the network header size in skb_csum_hwoffload_help() to
make sure no IPv6 packet with extension header is handed to a network
device with NETIF_F_IPV6_CSUM capability.
Link: https://lore.kernel.org/netdev/26548921.1r3eYUQgxm@benoit.monin/T/#u
Fixes: aa3463d65e ("fou: Add encap ops for IPv6 tunnels")
Signed-off-by: Benoît Monin <benoit.monin@gmx.fr>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/5fbeecfc311ea182aa1d1c771725ab8b4cac515e.1729778144.git.benoit.monin@gmx.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a persistent NAPI config area for NAPI configuration to the core.
Drivers opt-in to setting the persistent config for a NAPI by passing an
index when calling netif_napi_add_config.
napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
(after the NAPIs are deleted).
Drivers which call netif_napi_add_config will have persistent per-NAPI
settings: NAPI IDs, gro_flush_timeout, and defer_hard_irq settings.
Per-NAPI settings are saved in napi_disable and restored in napi_enable.
Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241011184527.16393-6-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Allow per-NAPI gro_flush_timeout setting.
The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device gro_flush_timeout
field. Reads from sysfs will read from the net_device field.
The ability to set gro_flush_timeout on specific NAPI instances will be
added in a later commit, via netdev-genl.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20241011184527.16393-4-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add defer_hard_irqs to napi_struct in preparation for per-NAPI
settings.
The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device defer_hard_irq field.
Reads from sysfs show the net_device field.
The ability to set defer_hard_irqs on specific NAPI instances will be
added in a later commit, via netdev-genl.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20241011184527.16393-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
hook into netif_set_real_num_tx_queues() to cleanup any shaper
configured on top of the to-be-destroyed TX queues.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/6da4ee03cae2b2a757d7b59e88baf09cc94c5ef1.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce the basic infrastructure to implement the net-shaper
core functionality. Each network devices carries a net-shaper cache,
the NL get() operation fetches the data from such cache.
The cache is initially empty, will be fill by the set()/group()
operation implemented later and is destroyed at device cleanup time.
The net_shaper_fill_handle(), net_shaper_ctx_init(), and
net_shaper_generic_pre() implementations handle generic index type
attributes, despite the current caller always pass a constant value
to avoid more noise in later patches using them with different
attributes.
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/ddd10fd645a9367803ad02fca4a5664ea5ace170.1728460186.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
One path takes care of SKB_GSO_DODGY, assuming
skb->len is bigger than hdr_len.
virtio_net_hdr_to_skb() does not fully dissect TCP headers,
it only make sure it is at least 20 bytes.
It is possible for an user to provide a malicious 'GSO' packet,
total length of 80 bytes.
- 20 bytes of IPv4 header
- 60 bytes TCP header
- a small gso_size like 8
virtio_net_hdr_to_skb() would declare this packet as a normal
GSO packet, because it would see 40 bytes of payload,
bigger than gso_size.
We need to make detect this case to not underflow
qdisc_skb_cb(skb)->pkt_len.
Fixes: 1def9238d4 ("net_sched: more precise pkt_len computation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit 24ab059d2e ("net: check dev->gso_max_size in gso_features_check()")
added a dev->gso_max_size test to gso_features_check() in order to fall
back to GSO when needed.
This was added as it was noticed that some drivers could misbehave if TSO
packets get too big. However, the check doesn't respect dev->gso_ipv4_max_size
limit. For instance, a device could be configured with BIG TCP for IPv4,
but not IPv6.
Therefore, add a netif_get_gso_max_size() equivalent to netif_get_gro_max_size()
and use the helper to respect both limits before falling back to GSO engine.
Fixes: 24ab059d2e ("net: check dev->gso_max_size in gso_features_check()")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240923212242.15669-2-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- Core:
- Remove a global lock in the affinity setting code
The lock protects a cpumask for intermediate results and the lock
causes a bottleneck on simultaneous start of multiple virtual
machines. Replace the lock and the static cpumask with a per CPU
cpumask which is nicely serialized by raw spinlock held when
executing this code.
- Provide support for giving a suffix to interrupt domain names.
That's required to support devices with subfunctions so that the
domain names are distinct even if they originate from the same
device node.
- The usual set of cleanups and enhancements all over the place
- Drivers:
- Support for longarch AVEC interrupt chip
- Refurbishment of the Armada driver so it can be extended for new
variants.
- The usual set of cleanups and enhancements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbn5p8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRFtD/43eB3h5usY2OPW0JmDqrE6qnzsvjPZ
1H52BcmMcOuI6yCfTnbi/fBB52mwSEGq9Dmt1GXradyq9/CJDIqZ1ajI1rA2jzW2
YdbeTDpKm1rS2ddzfp2LT2BryrNt+7etrRO7qHn4EKSuOcNuV2f58WPbIIqasvaK
uPbUDVDPrvXxLNcjoab6SqaKrEoAaHSyKpd0MvDd80wHrtcSC/QouW7JDSUXv699
RwvLebN1OF6mQ2J8Z3DLeCQpcbAs+UT8UvID7kYUJi1g71J/ZY+xpMLoX/gHiDNr
isBtsuEAiZeNaFpksc7A6Jgu5ljZf2/aLCqbPLlHaduHFNmo94x9KUbIF2cpEMN+
rsf5Ff7AVh1otz3cUwLLsm+cFLWRRoZdLuncn7rrgB4Yg0gll7qzyLO6YGvQHr8U
Ocj1RXtvvWsMk4XzhgCt1AH/42cO6go+bhA4HspeYykNpsIldIUl1MeFbO8sWiDJ
kybuwiwHp3oaMLjEK4Lpq65u7Ll8Lju2zRde65YUJN2nbNmJFORrOLmeC1qsr6ri
dpend6n2qD9UD1oAt32ej/uXnG160nm7UKescyxiZNeTm1+ez8GW31hY128ifTY3
4R3urGS38p3gazXBsfw6eqkeKx0kEoDNoQqrO5gBvb8kowYTvoZtkwMGAN9OADwj
w6vvU0i+NIyVMA==
=JlJ2
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Core:
- Remove a global lock in the affinity setting code
The lock protects a cpumask for intermediate results and the lock
causes a bottleneck on simultaneous start of multiple virtual
machines. Replace the lock and the static cpumask with a per CPU
cpumask which is nicely serialized by raw spinlock held when
executing this code.
- Provide support for giving a suffix to interrupt domain names.
That's required to support devices with subfunctions so that the
domain names are distinct even if they originate from the same
device node.
- The usual set of cleanups and enhancements all over the place
Drivers:
- Support for longarch AVEC interrupt chip
- Refurbishment of the Armada driver so it can be extended for new
variants.
- The usual set of cleanups and enhancements all over the place"
* tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits)
genirq: Use cpumask_intersects()
genirq/cpuhotplug: Use cpumask_intersects()
irqchip/apple-aic: Only access system registers on SoCs which provide them
irqchip/apple-aic: Add a new "Global fast IPIs only" feature level
irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi
dt-bindings: apple,aic: Document A7-A11 compatibles
irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy()
genirq/msi: Use kmemdup_array() instead of kmemdup()
genirq/proc: Change the return value for set affinity permission error
genirq/proc: Use irq_move_pending() in show_irq_affinity()
genirq/proc: Correctly set file permissions for affinity control files
genirq: Get rid of global lock in irq_do_set_affinity()
genirq: Fix typo in struct comment
irqchip/loongarch-avec: Add AVEC irqchip support
irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC
irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING
LoongArch: Architectural preparation for AVEC irqchip
LoongArch: Move irqchip function prototypes to irq-loongson.h
irqchip/loongson-pch-msi: Switch to MSI parent domains
softirq: Remove unused 'action' parameter from action callback
...
For device memory TCP, we expect the skb headers to be available in host
memory for access, and we expect the skb frags to be in device memory
and unaccessible to the host. We expect there to be no mixing and
matching of device memory frags (unaccessible) with host memory frags
(accessible) in the same skb.
Add a skb->devmem flag which indicates whether the frags in this skb
are device memory frags or not.
__skb_fill_netmem_desc() now checks frags added to skbs for net_iov,
and marks the skb as skb->devmem accordingly.
Add checks through the network stack to avoid accessing the frags of
devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-9-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make skb_frag_page() fail in the case where the frag is not backed
by a page, and fix its relevant callers to handle this case.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-8-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a netdev_dmabuf_binding struct which represents the
dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
rx queues on the netdevice. On the binding, the dma_buf_attach
& dma_buf_map_attachment will occur. The entries in the sg_table from
mapping will be inserted into a genpool to make it ready
for allocation.
The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
holds the dma-buf offset of the base of the chunk and the dma_addr of
the chunk. Both are needed to use allocations that come from this chunk.
We create a new type that represents an allocation from the genpool:
net_iov. We setup the net_iov allocation size in the
genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
allocated by the page pool and given to the drivers.
The user can unbind the dmabuf from the netdevice by closing the netlink
socket that established the binding. We do this so that the binding is
automatically unbound even if the userspace process crashes.
The binding and unbinding leaves an indicator in struct netdev_rx_queue
that the given queue is bound, and the binding is actuated by resetting
the rx queue using the queue API.
The netdev_dmabuf_binding struct is refcounted, and releases its
resources only when all the refs are released.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # excluding netlink
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dev_pick_tx_cpu_id() has been introduced with two users by
commit a4ea8a3dac ("net: Add generic ndo_select_queue functions").
The use in AF_PACKET has been removed in 2019 by
commit b71b5837f8 ("packet: rework packet_pick_tx_queue() to use common code selection")
The other user was a Netlogic XLP driver, removed in 2021 by
commit 47ac6f567c ("staging: Remove Netlogic XLP network driver").
It's relatively unlikely that any modern driver will need an
.ndo_select_queue implementation which picks purely based on CPU ID
and skips XPS, delete dev_pick_tx_cpu_id()
Found by code inspection.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240906161059.715546-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
list_head can be initialized automatically with LIST_HEAD()
instead of calling INIT_LIST_HEAD(). Here we can simplify
the code.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://patch.msgid.link/20240904093243.3345012-6-lihongbo22@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
"Interface can't change network namespaces" is rather an attribute,
not a feature, and it can't be changed via Ethtool.
Make it a "cold" private flag instead of a netdev_feature and free
one more bit.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Make dev->priv_flags `u32` back and define bits higher than 31 as
bitfield booleans as per Jakub's suggestion. This simplifies code
which accesses these bits with no optimization loss (testb both
before/after), allows to not extend &netdev_priv_flags each time,
but also scales better as bits > 63 in the future would only add
a new u64 to the structure with no complications, comparing to
that extending ::priv_flags would require converting it to a bitmap.
Note that I picked `unsigned long :1` to not lose any potential
optimizations comparing to `bool :1` etc.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The deprecated helper strcpy() performs no bounds checking on the
destination buffer. This could result in linear overflows beyond
the end of the buffer, leading to all kinds of misbehaviors.
The safe replacement is strscpy() [1].
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [1]
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmbHtrsACgkQ1V2XiooU
IOTZ8g/+OW8W468NmBHA2zrTWei19irA1iCBLvXMPakice0+ADU51eqVp6uUrCeP
iBUZGMtCq4WzFBrAEBePK3UNxxMHquWvAsA0kO/XW95KVM++s9ykF62q89jugMb3
CADEv/TxgJrkzpLWclxHNTCWMKpURijlkjT+kCMR4fKbeQnB6e/jI+2sdl7l5iRG
tHHm8ieewNNKE+jlSUJUrPEIM3tXRaZh9+JmbClfsF6wUw7qLmT/7P92aHBX4Owp
tpqi/Xc5/2k+Ud96a8u1NYrLLG+L70uz3SaeE7PvhaRavFuYftk2XLB4L2umtEfb
ZCZO/lCadH23XrVAUs5EtCDk4Tu3rZdTDsKYm2qS66uBsh/e6hg+j/cIPSO8jsNq
5Zbs/XzPFJ1PUpXVy8Sfs9vxH+cDuiqhy9nfKrbQotsqtoW+z52UoFH4WAjfmpqb
XMI+yeSTXYl1KIo2LV408VFRFRGcstBvXE7bOn7ufSrltRZcFdx7wqQBgVbh1zvA
1NTzIguZ+Lf2hPcNLPQd/f2vghKRTI7gUwzlDRw6so6NOWUM6/yV5KyoiZKekHjC
S6+M8cdiyMH8DmSsvAb46YKtDxYuIHqLVxuVqjfBHrMo1hLIo5smMCeCRA1Vabd5
/E4DTwpN5tVWX+HZl1wcAtQpXhcTktWM0qGSPnlRS11gwbAGWvk=
=O8tu
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next:
Patch #1 fix checksum calculation in nfnetlink_queue with SCTP,
segment GSO packet since skb_zerocopy() does not support
GSO_BY_FRAGS, from Antonio Ojea.
Patch #2 extend nfnetlink_queue coverage to handle SCTP packets,
from Antonio Ojea.
Patch #3 uses consume_skb() instead of kfree_skb() in nfnetlink,
from Donald Hunter.
Patch #4 adds a dedicate commit list for sets to speed up
intra-transaction lookups, from Florian Westphal.
Patch #5 skips removal of element from abort path for the pipapo
backend, ditching the shadow copy of this datastructure
is sufficient.
Patch #6 moves nf_ct_netns_get() out of nf_conncount_init() to
let users of conncoiunt decide when to enable conntrack,
this is needed by openvswitch, from Xin Long.
Patch #7 pass context to all nft_parse_register_load() in
preparation for the next patch.
Patches #8 and #9 reject loads from uninitialized registers from
control plane to remove register initialization from
datapath. From Florian Westphal.
* tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: don't initialize registers in nft_do_chain()
netfilter: nf_tables: allow loads only when register is initialized
netfilter: nf_tables: pass context structure to nft_parse_register_load
netfilter: move nf_ct_netns_get out of nf_conncount_init
netfilter: nf_tables: do not remove elements if set backend implements .abort
netfilter: nf_tables: store new sets in dedicated list
netfilter: nfnetlink: convert kfree_skb to consume_skb
selftests: netfilter: nft_queue.sh: sctp coverage
netfilter: nfnetlink_queue: unbreak SCTP traffic
====================
Link: https://patch.msgid.link/20240822221939.157858-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When net devices propagate xdp configurations to slave devices,
we will need to perform a memory provider check to ensure we're
not binding xdp to a device using unreadable netmem.
Currently the ->ndo_bpf calls in a few places. Adding checks to all
these places would not be ideal.
Refactor all the ->ndo_bpf calls into one place where we can add this
check in the future.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Link topologies containing multiple network PHYs attached to the same
net_device can be found when using a PHY as a media converter for use
with an SFP connector, on which an SFP transceiver containing a PHY can
be used.
With the current model, the transceiver's PHY can't be used for
operations such as cable testing, timestamping, macsec offload, etc.
The reason being that most of the logic for these configuration, coming
from either ethtool netlink or ioctls tend to use netdev->phydev, which
in multi-phy systems will reference the PHY closest to the MAC.
Introduce a numbering scheme allowing to enumerate PHY devices that
belong to any netdev, which can in turn allow userspace to take more
precise decisions with regard to each PHY's configuration.
The numbering is maintained per-netdev, in a phy_device_list.
The numbering works similarly to a netdevice's ifindex, with
identifiers that are only recycled once INT_MAX has been reached.
This prevents races that could occur between PHY listing and SFP
transceiver removal/insertion.
The identifiers are assigned at phy_attach time, as the numbering
depends on the netdevice the phy is attached to. The PHY index can be
re-used for PHYs that are persistent.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
When soft interrupt actions are called, they are passed a pointer to the
struct softirq action which contains the action's function pointer.
This pointer isn't useful, as the action callback already knows what
function it is. And since each callback handles a specific soft interrupt,
the callback also knows which soft interrupt number is running.
No soft interrupt action callback actually uses this parameter, so remove
it from the function pointer signature. This clarifies that soft interrupt
actions are global routines and makes it slightly cheaper to call them.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/all/20240815171549.3260003-1-csander@purestorage.com
when packet is enqueued with nfqueue and GSO is enabled, checksum
calculation has to take into account the protocol, as SCTP uses a
32 bits CRC checksum.
Enter skb_gso_segment() path in case of SCTP GSO packets because
skb_zerocopy() does not support for GSO_BY_FRAGS.
Joint work with Pablo.
Signed-off-by: Antonio Ojea <aojea@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add missing __percpu qualifier to a (void *) cast to fix
dev.c:10863:45: warning: cast removes address space '__percpu' of expression
sparse warning. Also remove now unneeded __force sparse directives.
Found by GCC's named address space checks.
There were no changes in the resulting object file.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://patch.msgid.link/20240814070748.943671-1-ubizjak@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP segmentation offload inherently depends on checksum offload. It should
not be possible to disable checksum offload while leaving USO enabled.
Enforce this dependency in code.
There is a single tx-udp-segmentation feature flag to indicate support for
both IPv4/6, hence the devices wishing to support USO must offer checksum
offload for both IP versions.
Fixes: 10154dbded ("udp: Allow GSO transmit from devices with no checksum offload")
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240808-udp-gso-egress-from-tunnel-v4-1-f5c5b4149ab9@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In fact, this structure contains a flexible array at the end, but
historically its size, alignment etc., is calculated manually.
There are several instances of the structure embedded into other
structures, but also there's ongoing effort to remove them and we
could in the meantime declare &net_device properly.
Declare the array explicitly, use struct_size() and store the array
size inside the structure, so that __counted_by() can be applied.
Don't use PTR_ALIGN(), as SLUB itself tries its best to ensure the
allocated buffer is aligned to what the user expects.
Also, change its alignment from %NETDEV_ALIGN to the cacheline size
as per several suggestions on the netdev ML.
bloat-o-meter for vmlinux:
free_netdev 445 440 -5
netdev_freemem 24 - -24
alloc_netdev_mqs 1481 1450 -31
On x86_64 with several NICs of different vendors, I was never able to
get a &net_device pointer not aligned to the cacheline size after the
change.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20240710113036.2125584-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While this is not needed to serialise the ethtool entry points (which
are all under RTNL), drivers may have cause to asynchronously access
dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to
do this safely without needing to take the RTNL.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/7f9c15eb7525bf87af62c275dde3a8570ee8bf0a.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new API to create/modify/remove RSS contexts, that passes in the
newly-chosen context ID (not as a pointer) rather than leaving the
driver to choose it on create. Also pass in the ctx, allowing drivers
to easily use its private data area to store their hardware-specific
state.
Keep the existing .set_rxfh API for now as a fallback, but deprecate it
for custom contexts (rss_context != 0).
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/45f1fe61df2163c091ec394c9f52000c8b16cc3b.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Each context stores the RXFH settings (indir, key, and hfunc) as well
as optionally some driver private data.
Delete any still-existing contexts at netdev unregister time.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/cbd1c402cec38f2e03124f2ab65b4ae4e08bd90d.1719502240.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net_dev->ethtool is a pointer to new struct ethtool_netdev_state, which
currently contains only the wol_enabled field.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/293a562278371de7534ed1eb17531838ca090633.1719502239.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
e3f02f32a0 ("ionic: fix kernel panic due to multi-buffer handling")
d9c0420999 ("ionic: Mark error paths in the data path as unlikely")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
packet and makes decisions. While doing that, the per-CPU variable
bpf_redirect_info is used.
- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
and it may also access other per-CPU variables like xskmap_flush_list.
At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.
The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.
PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.
Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it, bpf_net_ctx_clear() removes it again.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info. The returned data structure is zero initialized to
ensure nothing is leaked from stack. This is done on first usage of the
struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to
note that initialisation is required. First invocation of
bpf_net_ctx_get_ri() will memset() the data structure and update
bpf_redirect_info::kern_flags.
bpf_redirect_info::nh is excluded from memset because it is only used
once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is
moved past nh to exclude it from memset.
The pointer to bpf_net_context is saved task's task_struct. Using
always the bpf_net_context approach has the advantage that there is
almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-15-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
softnet_data::process_queue is a per-CPU variable and relies on disabled
BH for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.
softnet_data::input_queue_head can be updated lockless. This is fine
because this value is only update CPU local by the local backlog_napi
thread.
Add a local_lock_t to softnet_data and use local_lock_nested_bh() for locking
of process_queue. This change adds only lockdep coverage and does not
alter the functional behaviour for !PREEMPT_RT.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-11-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The backlog_napi locking (previously RPS) relies on explicit locking if
either RPS or backlog NAPI is enabled. If both are disabled then locking
was achieved by disabling interrupts except on PREEMPT_RT. PREEMPT_RT
was excluded because the needed synchronisation was already provided
local_bh_disable().
Since the introduction of backlog NAPI and making it mandatory for
PREEMPT_RT the ifdef within backlog_lock.*() is obsolete and can be
removed.
Remove the ifdefs in backlog_lock.*().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-10-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in
local_bh_disable() there is no guarantee that only one device is
transmitting at a time.
With preemption and multiple senders it is possible that the per-CPU
`recursion' counter gets incremented by different threads and exceeds
XMIT_RECURSION_LIMIT leading to a false positive recursion alert.
The `more' member is subject to similar problems if set by one thread
for one driver and wrongly used by another driver within another thread.
Instead of adding a lock to protect the per-CPU variable it is simpler
to make xmit per-task. Sending and receiving skbs happens always
in thread context anyway.
Having a lock to protected the per-CPU counter would block/ serialize two
sending threads needlessly. It would also require a recursive lock to
ensure that the owner can increment the counter further.
Make the softnet_data.xmit a task_struct member on PREEMPT_RT. Add
needed wrapper.
Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-9-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
skb does not include enough information to find out receiving
sockets/services and netns/containers on packet drops. In theory
skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP
stack for OOO packet lookup. Similarly, skb->sk often identifies a local
sender, and tells nothing about a receiver.
Allow passing an extra receiving socket to the tracepoint to improve
the visibility on receiving drops.
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently have dev_get_tstats64() for collecting per-cpu stats of
type pcpu_sw_netstats ("tstats"). However, tstats doesn't allow for
accounting tx/rx drops. We do have a stats variant that does have stats
for dropped packets: struct pcpu_dstats, but there are no core helpers
for using those stats.
The VRF driver uses dstats, by providing its own collation/fetch
functions to do so.
This change adds a common implementation for dstats-type collection,
used when pcpu_stat_type == NETDEV_PCPU_STAT_DSTAT. This is based on the
VRF driver's existing stats collator (plus the unused tx_drops stat from
there). We will switch the VRF driver to use this in the next change.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240607-dstats-v3-2-cc781fe116f7@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/pensando/ionic/ionic_txrx.c
d9c0420999 ("ionic: Mark error paths in the data path as unlikely")
491aee894a ("ionic: fix kernel panic in XDP_TX action")
net/ipv6/ip6_fib.c
b4cb4a1391 ("net: use unrcu_pointer() helper")
b01e1c0307 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mono_delivery_time was added to check if skb->tstamp has delivery
time in mono clock base (i.e. EDT) otherwise skb->tstamp has
timestamp in ingress and delivery_time at egress.
Renaming the bitfield from mono_delivery_time to tstamp_type is for
extensibilty for other timestamps such as userspace timestamp
(i.e. SO_TXTIME) set via sock opts.
As we are renaming the mono_delivery_time to tstamp_type, it makes
sense to start assigning tstamp_type based on enum defined
in this commit.
Earlier we used bool arg flag to check if the tstamp is mono in
function skb_set_delivery_time, Now the signature of the functions
accepts tstamp_type to distinguish between mono and real time.
Also skb_set_delivery_type_by_clockid is a new function which accepts
clockid to determine the tstamp_type.
In future tstamp_type:1 can be extended to support userspace timestamp
by increasing the bitfield.
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240509211834.3235191-2-quic_abchauha@quicinc.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>