Recent removal of ksize() in alloc_skb() increased
performance because we no longer read
the associated struct page.
We have an equivalent cost at kfree_skb() time.
kfree(skb->head) has to access a struct page,
often cold in cpu caches to get the owning
struct kmem_cache.
Considering that many allocations are small (at least for TCP ones)
we can have our own kmem_cache to avoid the cache line miss.
This also saves memory because these small heads
are no longer padded to 1024 bytes.
CONFIG_SLUB=y
$ grep skbuff_small_head /proc/slabinfo
skbuff_small_head 2907 2907 640 51 8 : tunables 0 0 0 : slabdata 57 57 0
CONFIG_SLAB=y
$ grep skbuff_small_head /proc/slabinfo
skbuff_small_head 607 624 640 6 1 : tunables 54 27 8 : slabdata 104 104 5
Notes:
- After Kees Cook patches and this one, we might
be able to revert commit
dbae2b0628 ("net: skb: introduce and use a single page frag cache")
because GRO_MAX_HEAD is also small.
- This patch is a NOP for CONFIG_SLOB=y builds.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All kmalloc_reserve() callers have to make the same computation,
we can factorize them, to prepare following patch in the series.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a cleanup patch, to prepare following change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have many places using this expression:
SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
Use of SKB_HEAD_ALIGN() will allow to clean them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEDs2BvajyNKlf9TJQvlAcSiqKBOgFAmPg+r4THG1rbEBwZW5n
dXRyb25peC5kZQAKCRC+UBxKKooE6A44B/0c8jjyLikVE2fEBKkvPk8iRlFKap48
140HNR+1TuVBBPCJTe2Gq3l8fDfFUgUVnFfVDjAQq6pMWufLkz9jVr5uGN7O94v+
tFCjip8NLRTa6ibgQlGrFXl31X5vqT+jV6V4l29wFgbwgOMRrod7wLDo+bHTtuuY
46Sf4GgIazYjxEeZBBsvjnAbraFq0twZFcEz9SyeOxqjR+9+71X4Y3x0kgQuz72N
CrvRng0yFLqbHOBE60iaDqbjuNnwBGTH476pwX25Kj9gVyIH20T45h8tV4j6z2Sq
A42cHKDlWZRWUiWQjGOHHT4igUf//Sfv8WDaFjqobO/DDY+nBA75EPAv
=hWLT
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-6.3-20230206' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2023-02-06
this is a pull request of 47 patches for net-next/master.
The first two patch is by Oliver Hartkopp. One adds missing error
checking to the CAN_GW protocol, the other adds a missing CAN address
family check to the CAN ISO TP protocol.
Thomas Kopp contributes a performance optimization to the mcp251xfd
driver.
The next 11 patches are by Geert Uytterhoeven and add support for
R-Car V4H systems to the rcar_canfd driver.
Stephane Grosjean and Lukas Magel contribute 8 patches to the peak_usb
driver, which add support for configurable CAN channel ID.
The last 17 patches are by me and target the CAN bit timing
configuration. The bit timing is cleaned up, error messages are
improved and forwarded to user space via NL_SET_ERR_MSG_FMT() instead
of netdev_err(), and the SJW handling is updated, including the
definition of a new default value that will benefit CAN-FD
controllers, by increasing their oscillator tolerance.
* tag 'linux-can-next-for-6.3-20230206' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (47 commits)
can: bittiming: can_validate_bitrate(): report error via netlink
can: bittiming: can_calc_bittiming(): convert from netdev_err() to NL_SET_ERR_MSG_FMT()
can: bittiming: can_calc_bittiming(): clean up SJW handling
can: bittiming: can_sjw_set_default(): use Phase Seg2 / 2 as default for SJW
can: bittiming: can_sjw_check(): check that SJW is not longer than either Phase Buffer Segment
can: bittiming: can_sjw_check(): report error via netlink and harmonize error value
can: bittiming: can_fixup_bittiming(): report error via netlink and harmonize error value
can: bittiming: factor out can_sjw_set_default() and can_sjw_check()
can: bittiming: can_changelink() pass extack down callstack
can: netlink: can_changelink(): convert from netdev_err() to NL_SET_ERR_MSG_FMT()
can: netlink: can_validate(): validate sample point for CAN and CAN-FD
can: dev: register_candev(): bail out if both fixed bit rates and bit timing constants are provided
can: dev: register_candev(): ensure that bittiming const are valid
can: bittiming: can_get_bittiming(): use direct return and remove unneeded else
can: bittiming: can_fixup_bittiming(): set effective tq
can: bittiming: can_fixup_bittiming(): use CAN_SYNC_SEG instead of 1
can: bittiming(): replace open coded variants of can_bit_time()
can: peak_usb: Reorder include directives alphabetically
can: peak_usb: align CAN channel ID format in log with sysfs attribute
can: peak_usb: export PCAN CAN channel ID as sysfs device attribute
...
====================
Link: https://lore.kernel.org/r/20230206131620.2758724-1-mkl@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If ops->get_mm() returns a non-zero error code, we goto out_complete,
but there, we return 0. Fix that to propagate the "ret" variable to the
caller. If ops->get_mm() succeeds, it will always return 0.
Fixes: 2b30f8291a ("net: ethtool: add support for MAC Merge layer")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230206094932.446379-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The ISO 11783-5 standard, in "4.5.2 - Address claim requirements", states:
d) No CF shall begin, or resume, transmission on the network until 250
ms after it has successfully claimed an address except when
responding to a request for address-claimed.
But "Figure 6" and "Figure 7" in "4.5.4.2 - Address-claim
prioritization" show that the CF begins the transmission after 250 ms
from the first AC (address-claimed) message even if it sends another AC
message during that time window to resolve the address contention with
another CF.
As stated in "4.4.2.3 - Address-claimed message":
In order to successfully claim an address, the CF sending an address
claimed message shall not receive a contending claim from another CF
for at least 250 ms.
As stated in "4.4.3.2 - NAME management (NM) message":
1) A commanding CF can
d) request that a CF with a specified NAME transmit the address-
claimed message with its current NAME.
2) A target CF shall
d) send an address-claimed message in response to a request for a
matching NAME
Taking the above arguments into account, the 250 ms wait is requested
only during network initialization.
Do not restart the timer on AC message if both the NAME and the address
match and so if the address has already been claimed (timer has expired)
or the AC message has been sent to resolve the contention with another
CF (timer is still running).
Signed-off-by: Devid Antonio Filoni <devid.filoni@egluetechnologies.com>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/all/20221125170418.34575-1-devid.filoni@egluetechnologies.com
Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Cc: stable@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Currently only the network namespace of devlink instance is monitored
for port events. If netdev is moved to a different namespace and then
unregistered, NETDEV_PRE_UNINIT is missed which leads to trigger
following WARN_ON in devl_port_unregister().
WARN_ON(devlink_port->type != DEVLINK_PORT_TYPE_NOTSET);
Fix this by changing the netdev notifier from per-net to global so no
event is missed.
Fixes: 02a68a47ea ("net: devlink: track netdev with devlink_port assigned")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230206094151.2557264-1-jiri@resnulli.us
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Use actual CPU number instead of hardcoded value to decide the size
of 'cpu_used_mask' in 'struct sw_flow'. Below is the reason.
'struct cpumask cpu_used_mask' is embedded in struct sw_flow.
Its size is hardcoded to CONFIG_NR_CPUS bits, which can be
8192 by default, it costs memory and slows down ovs_flow_alloc.
To address this:
Redefine cpu_used_mask to pointer.
Append cpumask_size() bytes after 'stat' to hold cpumask.
Initialization cpu_used_mask right after stats_last_writer.
APIs like cpumask_next and cpumask_set_cpu never access bits
beyond cpu count, cpumask_size() bytes of memory is enough.
Signed-off-by: Eddy Tao <taoyuan_eddy@hotmail.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/OS3P286MB229570CCED618B20355D227AF5D59@OS3P286MB2295.JPNP286.PROD.OUTLOOK.COM
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add sock_init_data_uid() to explicitly initialize the socket uid.
To initialise the socket uid, sock_init_data() assumes a the struct
socket* sock is always embedded in a struct socket_alloc, used to
access the corresponding inode uid. This may not be true.
Examples are sockets created in tun_chr_open() and tap_open().
Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are 2 classes of in-tree drivers currently:
- those who act upon struct tc_taprio_sched_entry :: gate_mask as if it
holds a bit mask of TXQs
- those who act upon the gate_mask as if it holds a bit mask of TCs
When it comes to the standard, IEEE 802.1Q-2018 does say this in the
second paragraph of section 8.6.8.4 Enhancements for scheduled traffic:
| A gate control list associated with each Port contains an ordered list
| of gate operations. Each gate operation changes the transmission gate
| state for the gate associated with each of the Port's traffic class
| queues and allows associated control operations to be scheduled.
In typically obtuse language, it refers to a "traffic class queue"
rather than a "traffic class" or a "queue". But careful reading of
802.1Q clarifies that "traffic class" and "queue" are in fact
synonymous (see 8.6.6 Queuing frames):
| A queue in this context is not necessarily a single FIFO data structure.
| A queue is a record of all frames of a given traffic class awaiting
| transmission on a given Bridge Port. The structure of this record is not
| specified.
i.o.w. their definition of "queue" isn't the Linux TX queue.
The gate_mask really is input into taprio via its UAPI as a mask of
traffic classes, but taprio_sched_to_offload() converts it into a TXQ
mask.
The breakdown of drivers which handle TC_SETUP_QDISC_TAPRIO is:
- hellcreek, felix, sja1105: these are DSA switches, it's not even very
clear what TXQs correspond to, other than purely software constructs.
Only the mqprio configuration with 8 TCs and 1 TXQ per TC makes sense.
So it's fine to convert these to a gate mask per TC.
- enetc: I have the hardware and can confirm that the gate mask is per
TC, and affects all TXQs (BD rings) configured for that priority.
- igc: in igc_save_qbv_schedule(), the gate_mask is clearly interpreted
to be per-TXQ.
- tsnep: Gerhard Engleder clarifies that even though this hardware
supports at most 1 TXQ per TC, the TXQ indices may be different from
the TC values themselves, and it is the TXQ indices that matter to
this hardware. So keep it per-TXQ as well.
- stmmac: I have a GMAC datasheet, and in the EST section it does
specify that the gate events are per TXQ rather than per TC.
- lan966x: again, this is a switch, and while not a DSA one, the way in
which it implements lan966x_mqprio_add() - by only allowing num_tc ==
NUM_PRIO_QUEUES (8) - makes it clear to me that TXQs are a purely
software construct here as well. They seem to map 1:1 with TCs.
- am65_cpsw: from looking at am65_cpsw_est_set_sched_cmds(), I get the
impression that the fetch_allow variable is treated like a prio_mask.
This definitely sounds closer to a per-TC gate mask rather than a
per-TXQ one, and TI documentation does seem to recomment an identity
mapping between TCs and TXQs. However, Roger Quadros would like to do
some testing before making changes, so I'm leaving this driver to
operate as it did before, for now. Link with more details at the end.
Based on this breakdown, we have 5 drivers with a gate mask per TC and
4 with a gate mask per TXQ. So let's make the gate mask per TXQ the
opt-in and the gate mask per TC the default.
Benefit from the TC_QUERY_CAPS feature that Jakub suggested we add, and
query the device driver before calling the proper ndo_setup_tc(), and
figure out if it expects one or the other format.
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230202003621.2679603-15-vladimir.oltean@nxp.com/#25193204
Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
Cc: Siddharth Vadapalli <s-vadapalli@ti.com>
Cc: Roger Quadros <rogerq@kernel.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The taprio qdisc does not currently pass the mqprio queue configuration
down to the offloading device driver. So the driver cannot act upon the
TXQ counts/offsets per TC, or upon the prio->tc map. It was probably
assumed that the driver only wants to offload num_tc (see
TC_MQPRIO_HW_OFFLOAD_TCS), which it can get from netdev_get_num_tc(),
but there's clearly more to the mqprio configuration than that.
I've considered 2 mechanisms to remedy that. First is to pass a struct
tc_mqprio_qopt_offload as part of the tc_taprio_qopt_offload. The second
is to make taprio actually call TC_SETUP_QDISC_MQPRIO, *in addition to*
TC_SETUP_QDISC_TAPRIO.
The difference is that in the first case, existing drivers (offloading
or not) all ignore taprio's mqprio portion currently, whereas in the
second case, we could control whether to call TC_SETUP_QDISC_MQPRIO,
based on a new capability. The question is which approach would be
better.
I'm afraid that calling TC_SETUP_QDISC_MQPRIO unconditionally (not based
on a taprio capability bit) would risk introducing regressions. For
example, taprio doesn't populate (or validate) qopt->hw, as well as
mqprio.flags, mqprio.shaper, mqprio.min_rate, mqprio.max_rate.
In comparison, adding a capability is functionally equivalent to just
passing the mqprio in a way that drivers can ignore it, except it's
slightly more complicated to use it (need to set the capability).
Ultimately, what made me go for the "mqprio in taprio" variant was that
it's easier for offloading drivers to interpret the mqprio qopt slightly
differently when it comes from taprio vs when it comes from mqprio,
should that ever become necessary.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The taprio qdisc will need to reconstruct a struct tc_mqprio_qopt from
netdev settings once more in a future patch, but this code was already
written twice, once in taprio and once in mqprio.
Refactor the code to a helper in the common mqprio library.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a lot of code in taprio which is "borrowed" from mqprio.
It makes sense to put a stop to the "borrowing" and start actually
reusing code.
Because taprio and mqprio are built as part of different kernel modules,
code reuse can only take place either by writing it as static inline
(limiting), putting it in sch_generic.o (not generic enough), or
creating a third auto-selectable kernel module which only holds library
code. I opted for the third variant.
In a previous change, mqprio gained support for reverse TC:TXQ mappings,
something which taprio still denies. Make taprio use the same validation
logic so that it supports this configuration as well.
The taprio code didn't enforce TXQ overlaps in txtime-assist mode and
that looks intentional, even if I've no idea why that might be. Preserve
that, but add a comment.
There isn't any dedicated MAINTAINERS entry for mqprio, so nothing to
update there.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make mqprio more user-friendly, create netlink extended ack messages
which say exactly what is wrong about the queue counts. This uses the
new support for printf-formatted extack messages.
Example:
$ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
Error: sch_mqprio: TC 0 queues 3@0 overlap with TC 1 queues 1@1.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mqprio_parse_opt() proudly has a comment:
/* If hardware offload is requested we will leave it to the device
* to either populate the queue counts itself or to validate the
* provided queue counts.
*/
Unfortunately some device drivers did not get this memo, and don't
validate the queue counts, or populate them.
In case drivers don't want to populate the queue counts themselves, just
act upon the requested configuration, it makes sense to introduce a tc
capability, and make mqprio query it, so they don't have to do the
validation themselves.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By imposing that the last TXQ of TC i is smaller than the first TXQ of
any TC j (j := i+1 .. n), mqprio imposes a strict ordering condition for
the TXQ indices (they must increase as TCs increase).
Claudiu points out that the complexity of the TXQ count validation is
too high for this logic, i.e. instead of iterating over j, it is
sufficient that the TXQ indices of TC i and i + 1 are ordered, and that
will eventually ensure global ordering.
This is true, however it doesn't appear to me that is what the code
really intended to do. Instead, based on the comments, it just wanted to
check for overlaps (and this isn't how one does that).
So the following mqprio configuration, which I had recommended to
Vinicius more than once for igb/igc (to account for the fact that on
this hardware, lower numbered TXQs have higher dequeue priority than
higher ones):
num_tc 4 map 0 1 2 3 queues 1@3 1@2 1@1 1@0
is in fact denied today by mqprio.
The full story is that in fact, it's only denied with "hw 0"; if
hardware offloading is requested, mqprio defers TXQ range overlap
validation to the device driver (a strange decision in itself).
This is most certainly a bug, but it's not one that has any merit for
being fixed on "stable" as far as I can tell. This is because mqprio
always rejected a configuration which was in fact valid, and this has
shaped the way in which mqprio configuration scripts got built for
various hardware (see igb/igc in the link below). Therefore, one could
consider it to be merely an improvement for mqprio to allow reverse
TC:TXQ mappings.
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-9-vladimir.oltean@nxp.com/#25188310
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230128010719.2182346-6-vladimir.oltean@nxp.com/#25186442
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some more logic will be added to mqprio offloading, so split that code
up from mqprio_init(), which is already large, and create a new
function, mqprio_enable_offload(), similar to taprio_enable_offload().
Also create the opposite function mqprio_disable_offload().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mqprio_init() is quite large and unwieldy to add more code to.
Split the netlink attribute parsing to a dedicated function.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First user of skb_poison_list is in kfree_skb_list_reason, to catch bugs
earlier like introduced in commit eedade12f4 ("net: kfree_skb_list use
kmem_cache_free_bulk"). For completeness mentioned bug have been fixed in
commit f72ff8b81e ("net: fix kfree_skb_list use of skb_mark_not_on_list").
In case of a bug like mentioned commit we would have seen OOPS with:
general protection fault, probably for non-canonical address 0xdead000000000870
And content of one the registers e.g. R13: dead000000000800
In this case skb->len is at offset 112 bytes (0x70) why fault happens at
0x800+0x70 = 0x870
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We use BH context only for synchronization, so we don't care if it's
actually serving softirq or not.
As a side node, in case of threaded NAPI, in_serving_softirq() will
return false because it's in process context with BH off, making
page_pool_recycle_in_cache() unreachable.
Signed-off-by: Qingfang DENG <qingfang.deng@siflower.com.cn>
Tested-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous patch added accounting for number of MDB entries per port and
per port-VLAN, and the logic to verify that these values stay within
configured bounds. However it didn't provide means to actually configure
those bounds or read the occupancy. This patch does that.
Two new netlink attributes are added for the MDB occupancy:
IFLA_BRPORT_MCAST_N_GROUPS for the per-port occupancy and
BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS for the per-port-VLAN occupancy.
And another two for the maximum number of MDB entries:
IFLA_BRPORT_MCAST_MAX_GROUPS for the per-port maximum, and
BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS for the per-port-VLAN one.
Note that the two new IFLA_BRPORT_ attributes prompt bumping of
RTNL_SLAVE_MAX_TYPE to size the slave attribute tables large enough.
The new attributes are used like this:
# ip link add name br up type bridge vlan_filtering 1 mcast_snooping 1 \
mcast_vlan_snooping 1 mcast_querier 1
# ip link set dev v1 master br
# bridge vlan add dev v1 vid 2
# bridge vlan set dev v1 vid 1 mcast_max_groups 1
# bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1
# bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1
Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1.
# bridge link set dev v1 mcast_max_groups 1
# bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 2
Error: bridge: Port is already in 1 groups, and mcast_max_groups=1.
# bridge -d link show
5: v1@v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br [...]
[...] mcast_n_groups 1 mcast_max_groups 1
# bridge -d vlan show
port vlan-id
br 1 PVID Egress Untagged
state forwarding mcast_router 1
v1 1 PVID Egress Untagged
[...] mcast_n_groups 1 mcast_max_groups 1
2
[...] mcast_n_groups 0 mcast_max_groups 0
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MDB maintained by the bridge is limited. When the bridge is configured
for IGMP / MLD snooping, a buggy or malicious client can easily exhaust its
capacity. In SW datapath, the capacity is configurable through the
IFLA_BR_MCAST_HASH_MAX parameter, but ultimately is finite. Obviously a
similar limit exists in the HW datapath for purposes of offloading.
In order to prevent the issue of unilateral exhaustion of MDB resources,
introduce two parameters in each of two contexts:
- Per-port and per-port-VLAN number of MDB entries that the port
is member in.
- Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled)
per-port-VLAN maximum permitted number of MDB entries, or 0 for
no limit.
The per-port multicast context is used for tracking of MDB entries for the
port as a whole. This is available for all bridges.
The per-port-VLAN multicast context is then only available on
VLAN-filtering bridges on VLANs that have multicast snooping on.
With these changes in place, it will be possible to configure MDB limit for
bridge as a whole, or any one port as a whole, or any single port-VLAN.
Note that unlike the global limit, exhaustion of the per-port and
per-port-VLAN maximums does not cause disablement of multicast snooping.
It is also permitted to configure the local limit larger than hash_max,
even though that is not useful.
In this patch, introduce only the accounting for number of entries, and the
max field itself, but not the means to toggle the max. The next patch
introduces the netlink APIs to toggle and read the values.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following patch will add two more maximum MDB allowances to the global
one, mcast_hash_max, that exists today. In all these cases, attempts to add
MDB entries above the configured maximums through netlink, fail noisily and
obviously. Such visibility is missing when adding entries through the
control plane traffic, by IGMP or MLD packets.
To improve visibility in those cases, add a trace point that reports the
violation, including the relevant netdevice (be it a slave or the bridge
itself), and the MDB entry parameters:
# perf record -e bridge:br_mdb_full &
# [...]
# perf script | cut -d: -f4-
dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 0
dev v2 af 10 src :: grp ff0e::112/00:00:00:00:00:00 vid 0
dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 10
dev v2 af 10 src 2001:db8:1::1 grp ff0e::1/00:00:00:00:00:00 vid 10
dev v2 af 2 src ::ffff:192.0.2.1 grp ::ffff:239.1.1.1/00:00:00:00:00:00 vid 10
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-trace-kernel@vger.kernel.org
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function is getting more to clean up in the following patches.
Structuring the cleanups in one labeled block will allow reusing the same
cleanup from several places.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since cleaning up the effects of br_multicast_new_port_group() just
consists of delisting and freeing the memory, the function
br_mdb_add_group_star_g() inlines the corresponding code. In the following
patches, number of per-port and per-port-VLAN MDB entries is going to be
maintained, and that counter will have to be updated. Because that logic
is going to be hidden in the br_multicast module, introduce a new hook
intended to again remove a newly-created group.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that br_multicast_new_port_group() takes an extack argument, move
setting the extack there. The downside is that the error messages end
up being less specific (the function cannot distinguish between (S,G)
and (*,G) groups). However, the alternative is to check in the caller
whether the callee set the extack, and if it didn't, set it. But that
is only done when the callee is not exactly known. (E.g. in case of a
notifier invocation.)
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make it possible to set an extack in br_multicast_new_port_group().
Eventually, this function will check for per-port and per-port-vlan
MDB maximums, and will use the extack to communicate the reason for
the bounce.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make any attributes newly-added to br_port_policy or vlan_tunnel_policy
parsed strictly, to prevent userspace from passing garbage. Note that this
patchset only touches the former policy. The latter was adjusted for
completeness' sake. There do not appear to be other _deprecated calls
with non-NULL policies.
Suggested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Entries can linger in cache without timer for days, thanks to
the gc_thresh1 limit. As result, without traffic, the confirmed
time can be outdated and to appear to be in the future. Later,
on traffic, NUD_STALE entries can switch to NUD_DELAY and start
the timer which can see the invalid confirmed time and wrongly
switch to NUD_REACHABLE state instead of NUD_PROBE. As result,
timer is set many days in the future. This is more visible on
32-bit platforms, with higher HZ value.
Why this is a problem? While we expect unused entries to expire,
such entries stay in REACHABLE state for too long, locked in
cache. They are not expired normally, only when cache is full.
Problem and the wrong state change reported by Zhang Changzhong:
172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE
350520 seconds have elapsed since this entry was last updated, but it is
still in the REACHABLE state (base_reachable_time_ms is 30000),
preventing lladdr from being updated through probe.
Fix it by ensuring timer is started with valid used/confirmed
times. Considering the valid time range is LONG_MAX jiffies,
we try not to go too much in the past while we are in
DELAY/PROBE state. There are also places that need
used/updated times to be validated while timer is not running.
Reported-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's clear that rmbs_lock and sndbufs_lock are aims to protect the
rmbs list or the sndbufs list.
During connection establieshment, smc_buf_get_slot() will always
be invoked, and it only performs read semantics in rmbs list and
sndbufs list.
Based on the above considerations, we replace mutex with rw_semaphore.
Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot()
run concurrently, other part use down_write() to keep exclusive
semantics.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlike smc_buf_create() and smcr_buf_unuse(), smcr_lgr_reg_rmbs() is
exclusive when assigned rmb_desc was not registered, although it can be
executed in parallel when assigned rmb_desc was registered already
and only performs read semtamics on it. Hence, we can not simply replace
it with read semaphore.
The idea here is that if the assigned rmb_desc was registered already,
use read semaphore to protect the critical section, once the assigned
rmb_desc was not registered, keep using keep write semaphore still
to keep its exclusivity.
Thanks to the reusable features of rmb_desc, which allows us to execute
in parallel in most cases.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following is part of Off-CPU graph during frequent SMC-R short-lived
processing:
process_one_work (51.19%)
smc_close_passive_work (28.36%)
smcr_buf_unuse (28.34%)
rwsem_down_write_slowpath (28.22%)
smc_listen_work (22.83%)
smc_clc_wait_msg (1.84%)
smc_buf_create (20.45%)
smcr_buf_map_usable_links
rwsem_down_write_slowpath (20.43%)
smcr_lgr_reg_rmbs (0.53%)
rwsem_down_write_slowpath (0.43%)
smc_llc_do_confirm_rkey (0.08%)
We can clearly see that during the connection establishment time,
waiting time of connections is not on IO, but on llc_conf_mutex.
What is more important, the core critical area (smcr_buf_unuse() &
smc_buf_create()) only perfroms read semantics on links, we can
easily replace it with read semaphore.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
llc_conf_mutex was used to protect links and link related configurations
in the same link group, for example, add or delete links. However,
in most cases, the protected critical area has only read semantics and
with no write semantics at all, such as obtaining a usable link or an
available rmb_desc.
This patch do simply code refactoring, replace mutex with rw_semaphore,
replace mutex_lock with down_write and replace mutex_unlock with
up_write.
Theoretically, this replacement is equivalent, but after this patch,
we can distinguish lock granularity according to different semantics
of critical areas.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some applications seem to rely on RAW sockets.
If they use private netns, we can avoid piling all RAW
sockets bound to a given protocol into a single bucket.
Also place (struct raw_hashinfo).lock into its own
cache line to limit false sharing.
Alternative would be to have per-netns hashtables,
but this seems too expensive for most netns
where RAW sockets are not used.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use existing helpers and drop reason codes for RAW input path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use existing helpers and drop reason codes for RAW input path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev selftest callbacks and related code from leftover.c to
file dev.c. No functional change in this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As all users of the struct devlink_info_req are already in dev.c, move
this struct from devl_internal.c to be local in dev.c.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev flash callbacks, helpers and other related code from
leftover.c to dev.c. No functional change in this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev info callbacks, related drivers helpers functions and
other related code from leftover.c to dev.c. No functional change in
this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev eswitch callbacks and related code from leftover.c to
file dev.c. No functional change in this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev reload callback and related code from leftover.c to
file dev.c. No functional change in this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move devlink dev get and dump callbacks and related dev code to new file
dev.c. This file shall include all callbacks that are specific on
devlink dev object.
No functional change in this patch.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that commit 028fb19c6b ("netlink: provide an ability to set
default extack message") provides a weak function that doesn't override
an existing extack message provided by the driver, it makes sense to use
it also for LAG and HSR offloading, not just for bridge offloading.
Also consistently put the message string on a separate line, to reduce
line length from 92 to 84 characters.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20230202140354.3158129-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the bvec_set_page helper to initialize bvecs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-22-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the bvec_set_page helper to initialize a bvec.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-21-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both synchronous early drop algorithm and asynchronous gc worker completely
ignore connections with IPS_OFFLOAD_BIT status bit set. With new
functionality that enabled UDP NEW connection offload in action CT
malicious user can flood the conntrack table with offloaded UDP connections
by just sending a single packet per 5tuple because such connections can no
longer be deleted by early drop algorithm.
To mitigate the issue allow both early drop and gc to consider offloaded
UDP connections for deletion.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>