linux

mirror of synced 2025-03-06 20:59:54 +01:00

Author	SHA1	Message	Date
Eric Dumazet	bf9f1baa27	net: add dedicated kmem_cache for typical/small skb->head Recent removal of ksize() in alloc_skb() increased performance because we no longer read the associated struct page. We have an equivalent cost at kfree_skb() time. kfree(skb->head) has to access a struct page, often cold in cpu caches to get the owning struct kmem_cache. Considering that many allocations are small (at least for TCP ones) we can have our own kmem_cache to avoid the cache line miss. This also saves memory because these small heads are no longer padded to 1024 bytes. CONFIG_SLUB=y $ grep skbuff_small_head /proc/slabinfo skbuff_small_head 2907 2907 640 51 8 : tunables 0 0 0 : slabdata 57 57 0 CONFIG_SLAB=y $ grep skbuff_small_head /proc/slabinfo skbuff_small_head 607 624 640 6 1 : tunables 54 27 8 : slabdata 104 104 5 Notes: - After Kees Cook patches and this one, we might be able to revert commit `dbae2b0628` ("net: skb: introduce and use a single page frag cache") because GRO_MAX_HEAD is also small. - This patch is a NOP for CONFIG_SLOB=y builds. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-07 10:59:58 -08:00
Eric Dumazet	5c0e820cbb	net: factorize code in kmalloc_reserve() All kmalloc_reserve() callers have to make the same computation, we can factorize them, to prepare following patch in the series. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-07 10:59:55 -08:00
Eric Dumazet	65998d2bf8	net: remove osize variable in __alloc_skb() This is a cleanup patch, to prepare following change. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-07 10:59:52 -08:00
Eric Dumazet	115f1a5c42	net: add SKB_HEAD_ALIGN() helper We have many places using this expression: SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) Use of SKB_HEAD_ALIGN() will allow to clean them. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-07 10:59:48 -08:00
Paolo Abeni	61d731e653	linux-can-next-for-6.3-20230206 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEDs2BvajyNKlf9TJQvlAcSiqKBOgFAmPg+r4THG1rbEBwZW5n dXRyb25peC5kZQAKCRC+UBxKKooE6A44B/0c8jjyLikVE2fEBKkvPk8iRlFKap48 140HNR+1TuVBBPCJTe2Gq3l8fDfFUgUVnFfVDjAQq6pMWufLkz9jVr5uGN7O94v+ tFCjip8NLRTa6ibgQlGrFXl31X5vqT+jV6V4l29wFgbwgOMRrod7wLDo+bHTtuuY 46Sf4GgIazYjxEeZBBsvjnAbraFq0twZFcEz9SyeOxqjR+9+71X4Y3x0kgQuz72N CrvRng0yFLqbHOBE60iaDqbjuNnwBGTH476pwX25Kj9gVyIH20T45h8tV4j6z2Sq A42cHKDlWZRWUiWQjGOHHT4igUf//Sfv8WDaFjqobO/DDY+nBA75EPAv =hWLT -----END PGP SIGNATURE----- Merge tag 'linux-can-next-for-6.3-20230206' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2023-02-06 this is a pull request of 47 patches for net-next/master. The first two patch is by Oliver Hartkopp. One adds missing error checking to the CAN_GW protocol, the other adds a missing CAN address family check to the CAN ISO TP protocol. Thomas Kopp contributes a performance optimization to the mcp251xfd driver. The next 11 patches are by Geert Uytterhoeven and add support for R-Car V4H systems to the rcar_canfd driver. Stephane Grosjean and Lukas Magel contribute 8 patches to the peak_usb driver, which add support for configurable CAN channel ID. The last 17 patches are by me and target the CAN bit timing configuration. The bit timing is cleaned up, error messages are improved and forwarded to user space via NL_SET_ERR_MSG_FMT() instead of netdev_err(), and the SJW handling is updated, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. * tag 'linux-can-next-for-6.3-20230206' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (47 commits) can: bittiming: can_validate_bitrate(): report error via netlink can: bittiming: can_calc_bittiming(): convert from netdev_err() to NL_SET_ERR_MSG_FMT() can: bittiming: can_calc_bittiming(): clean up SJW handling can: bittiming: can_sjw_set_default(): use Phase Seg2 / 2 as default for SJW can: bittiming: can_sjw_check(): check that SJW is not longer than either Phase Buffer Segment can: bittiming: can_sjw_check(): report error via netlink and harmonize error value can: bittiming: can_fixup_bittiming(): report error via netlink and harmonize error value can: bittiming: factor out can_sjw_set_default() and can_sjw_check() can: bittiming: can_changelink() pass extack down callstack can: netlink: can_changelink(): convert from netdev_err() to NL_SET_ERR_MSG_FMT() can: netlink: can_validate(): validate sample point for CAN and CAN-FD can: dev: register_candev(): bail out if both fixed bit rates and bit timing constants are provided can: dev: register_candev(): ensure that bittiming const are valid can: bittiming: can_get_bittiming(): use direct return and remove unneeded else can: bittiming: can_fixup_bittiming(): set effective tq can: bittiming: can_fixup_bittiming(): use CAN_SYNC_SEG instead of 1 can: bittiming(): replace open coded variants of can_bit_time() can: peak_usb: Reorder include directives alphabetically can: peak_usb: align CAN channel ID format in log with sysfs attribute can: peak_usb: export PCAN CAN channel ID as sysfs device attribute ... ==================== Link: https://lore.kernel.org/r/20230206131620.2758724-1-mkl@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-07 15:54:09 +01:00
Vladimir Oltean	ca8e4cbff6	ethtool: mm: fix get_mm() return code not propagating to user space If ops->get_mm() returns a non-zero error code, we goto out_complete, but there, we return 0. Fix that to propagate the "ret" variable to the caller. If ops->get_mm() succeeds, it will always return 0. Fixes: `2b30f8291a` ("net: ethtool: add support for MAC Merge layer") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230206094932.446379-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-07 15:39:06 +01:00
Devid Antonio Filoni	4ae5e1e97c	can: j1939: do not wait 250 ms if the same addr was already claimed The ISO 11783-5 standard, in "4.5.2 - Address claim requirements", states: d) No CF shall begin, or resume, transmission on the network until 250 ms after it has successfully claimed an address except when responding to a request for address-claimed. But "Figure 6" and "Figure 7" in "4.5.4.2 - Address-claim prioritization" show that the CF begins the transmission after 250 ms from the first AC (address-claimed) message even if it sends another AC message during that time window to resolve the address contention with another CF. As stated in "4.4.2.3 - Address-claimed message": In order to successfully claim an address, the CF sending an address claimed message shall not receive a contending claim from another CF for at least 250 ms. As stated in "4.4.3.2 - NAME management (NM) message": 1) A commanding CF can d) request that a CF with a specified NAME transmit the address- claimed message with its current NAME. 2) A target CF shall d) send an address-claimed message in response to a request for a matching NAME Taking the above arguments into account, the 250 ms wait is requested only during network initialization. Do not restart the timer on AC message if both the NAME and the address match and so if the address has already been claimed (timer has expired) or the AC message has been sent to resolve the contention with another CF (timer is still running). Signed-off-by: Devid Antonio Filoni <devid.filoni@egluetechnologies.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/all/20221125170418.34575-1-devid.filoni@egluetechnologies.com Fixes: `9d71dd0c70` ("can: add support of SAE J1939 protocol") Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2023-02-07 15:00:22 +01:00
Jiri Pirko	565b4824c3	devlink: change port event netdev notifier from per-net to global Currently only the network namespace of devlink instance is monitored for port events. If netdev is moved to a different namespace and then unregistered, NETDEV_PRE_UNINIT is missed which leads to trigger following WARN_ON in devl_port_unregister(). WARN_ON(devlink_port->type != DEVLINK_PORT_TYPE_NOTSET); Fix this by changing the netdev notifier from per-net to global so no event is missed. Fixes: `02a68a47ea` ("net: devlink: track netdev with devlink_port assigned") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230206094151.2557264-1-jiri@resnulli.us Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-02-07 14:13:55 +01:00
Eddy Tao	15ea59a0e9	net: openvswitch: reduce cpu_used_mask memory Use actual CPU number instead of hardcoded value to decide the size of 'cpu_used_mask' in 'struct sw_flow'. Below is the reason. 'struct cpumask cpu_used_mask' is embedded in struct sw_flow. Its size is hardcoded to CONFIG_NR_CPUS bits, which can be 8192 by default, it costs memory and slows down ovs_flow_alloc. To address this: Redefine cpu_used_mask to pointer. Append cpumask_size() bytes after 'stat' to hold cpumask. Initialization cpu_used_mask right after stats_last_writer. APIs like cpumask_next and cpumask_set_cpu never access bits beyond cpu count, cpumask_size() bytes of memory is enough. Signed-off-by: Eddy Tao <taoyuan_eddy@hotmail.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/OS3P286MB229570CCED618B20355D227AF5D59@OS3P286MB2295.JPNP286.PROD.OUTLOOK.COM Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-06 22:36:29 -08:00
Pietro Borrello	584f374289	net: add sock_init_data_uid() Add sock_init_data_uid() to explicitly initialize the socket uid. To initialise the socket uid, sock_init_data() assumes a the struct socket* sock is always embedded in a struct socket_alloc, used to access the corresponding inode uid. This may not be true. Examples are sockets created in tun_chr_open() and tap_open(). Fixes: `86741ec254` ("net: core: Add a UID field to struct sock.") Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:16:55 +00:00
Vladimir Oltean	522d15ea83	net/sched: taprio: only pass gate mask per TXQ for igc, stmmac, tsnep, am65_cpsw There are 2 classes of in-tree drivers currently: - those who act upon struct tc_taprio_sched_entry :: gate_mask as if it holds a bit mask of TXQs - those who act upon the gate_mask as if it holds a bit mask of TCs When it comes to the standard, IEEE 802.1Q-2018 does say this in the second paragraph of section 8.6.8.4 Enhancements for scheduled traffic: \| A gate control list associated with each Port contains an ordered list \| of gate operations. Each gate operation changes the transmission gate \| state for the gate associated with each of the Port's traffic class \| queues and allows associated control operations to be scheduled. In typically obtuse language, it refers to a "traffic class queue" rather than a "traffic class" or a "queue". But careful reading of 802.1Q clarifies that "traffic class" and "queue" are in fact synonymous (see 8.6.6 Queuing frames): \| A queue in this context is not necessarily a single FIFO data structure. \| A queue is a record of all frames of a given traffic class awaiting \| transmission on a given Bridge Port. The structure of this record is not \| specified. i.o.w. their definition of "queue" isn't the Linux TX queue. The gate_mask really is input into taprio via its UAPI as a mask of traffic classes, but taprio_sched_to_offload() converts it into a TXQ mask. The breakdown of drivers which handle TC_SETUP_QDISC_TAPRIO is: - hellcreek, felix, sja1105: these are DSA switches, it's not even very clear what TXQs correspond to, other than purely software constructs. Only the mqprio configuration with 8 TCs and 1 TXQ per TC makes sense. So it's fine to convert these to a gate mask per TC. - enetc: I have the hardware and can confirm that the gate mask is per TC, and affects all TXQs (BD rings) configured for that priority. - igc: in igc_save_qbv_schedule(), the gate_mask is clearly interpreted to be per-TXQ. - tsnep: Gerhard Engleder clarifies that even though this hardware supports at most 1 TXQ per TC, the TXQ indices may be different from the TC values themselves, and it is the TXQ indices that matter to this hardware. So keep it per-TXQ as well. - stmmac: I have a GMAC datasheet, and in the EST section it does specify that the gate events are per TXQ rather than per TC. - lan966x: again, this is a switch, and while not a DSA one, the way in which it implements lan966x_mqprio_add() - by only allowing num_tc == NUM_PRIO_QUEUES (8) - makes it clear to me that TXQs are a purely software construct here as well. They seem to map 1:1 with TCs. - am65_cpsw: from looking at am65_cpsw_est_set_sched_cmds(), I get the impression that the fetch_allow variable is treated like a prio_mask. This definitely sounds closer to a per-TC gate mask rather than a per-TXQ one, and TI documentation does seem to recomment an identity mapping between TCs and TXQs. However, Roger Quadros would like to do some testing before making changes, so I'm leaving this driver to operate as it did before, for now. Link with more details at the end. Based on this breakdown, we have 5 drivers with a gate mask per TC and 4 with a gate mask per TXQ. So let's make the gate mask per TXQ the opt-in and the gate mask per TC the default. Benefit from the TC_QUERY_CAPS feature that Jakub suggested we add, and query the device driver before calling the proper ndo_setup_tc(), and figure out if it expects one or the other format. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230202003621.2679603-15-vladimir.oltean@nxp.com/#25193204 Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Cc: Siddharth Vadapalli <s-vadapalli@ti.com> Cc: Roger Quadros <rogerq@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	09c794c0a8	net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc() The taprio qdisc does not currently pass the mqprio queue configuration down to the offloading device driver. So the driver cannot act upon the TXQ counts/offsets per TC, or upon the prio->tc map. It was probably assumed that the driver only wants to offload num_tc (see TC_MQPRIO_HW_OFFLOAD_TCS), which it can get from netdev_get_num_tc(), but there's clearly more to the mqprio configuration than that. I've considered 2 mechanisms to remedy that. First is to pass a struct tc_mqprio_qopt_offload as part of the tc_taprio_qopt_offload. The second is to make taprio actually call TC_SETUP_QDISC_MQPRIO, in addition to TC_SETUP_QDISC_TAPRIO. The difference is that in the first case, existing drivers (offloading or not) all ignore taprio's mqprio portion currently, whereas in the second case, we could control whether to call TC_SETUP_QDISC_MQPRIO, based on a new capability. The question is which approach would be better. I'm afraid that calling TC_SETUP_QDISC_MQPRIO unconditionally (not based on a taprio capability bit) would risk introducing regressions. For example, taprio doesn't populate (or validate) qopt->hw, as well as mqprio.flags, mqprio.shaper, mqprio.min_rate, mqprio.max_rate. In comparison, adding a capability is functionally equivalent to just passing the mqprio in a way that drivers can ignore it, except it's slightly more complicated to use it (need to set the capability). Ultimately, what made me go for the "mqprio in taprio" variant was that it's easier for offloading drivers to interpret the mqprio qopt slightly differently when it comes from taprio vs when it comes from mqprio, should that ever become necessary. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	9dd6ad674c	net/sched: refactor mqprio qopt reconstruction to a library function The taprio qdisc will need to reconstruct a struct tc_mqprio_qopt from netdev settings once more in a future patch, but this code was already written twice, once in taprio and once in mqprio. Refactor the code to a helper in the common mqprio library. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	1dfe086dd7	net/sched: taprio: centralize mqprio qopt validation There is a lot of code in taprio which is "borrowed" from mqprio. It makes sense to put a stop to the "borrowing" and start actually reusing code. Because taprio and mqprio are built as part of different kernel modules, code reuse can only take place either by writing it as static inline (limiting), putting it in sch_generic.o (not generic enough), or creating a third auto-selectable kernel module which only holds library code. I opted for the third variant. In a previous change, mqprio gained support for reverse TC:TXQ mappings, something which taprio still denies. Make taprio use the same validation logic so that it supports this configuration as well. The taprio code didn't enforce TXQ overlaps in txtime-assist mode and that looks intentional, even if I've no idea why that might be. Preserve that, but add a comment. There isn't any dedicated MAINTAINERS entry for mqprio, so nothing to update there. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	d404959fa2	net/sched: mqprio: add extack messages for queue count validation To make mqprio more user-friendly, create netlink extended ack messages which say exactly what is wrong about the queue counts. This uses the new support for printf-formatted extack messages. Example: $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \ map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0 Error: sch_mqprio: TC 0 queues 3@0 overlap with TC 1 queues 1@1. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	19278d7691	net/sched: mqprio: allow offloading drivers to request queue count validation mqprio_parse_opt() proudly has a comment: /* If hardware offload is requested we will leave it to the device * to either populate the queue counts itself or to validate the * provided queue counts. */ Unfortunately some device drivers did not get this memo, and don't validate the queue counts, or populate them. In case drivers don't want to populate the queue counts themselves, just act upon the requested configuration, it makes sense to introduce a tc capability, and make mqprio query it, so they don't have to do the validation themselves. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:44 +00:00
Vladimir Oltean	d7045f520a	net/sched: mqprio: allow reverse TC:TXQ mappings By imposing that the last TXQ of TC i is smaller than the first TXQ of any TC j (j := i+1 .. n), mqprio imposes a strict ordering condition for the TXQ indices (they must increase as TCs increase). Claudiu points out that the complexity of the TXQ count validation is too high for this logic, i.e. instead of iterating over j, it is sufficient that the TXQ indices of TC i and i + 1 are ordered, and that will eventually ensure global ordering. This is true, however it doesn't appear to me that is what the code really intended to do. Instead, based on the comments, it just wanted to check for overlaps (and this isn't how one does that). So the following mqprio configuration, which I had recommended to Vinicius more than once for igb/igc (to account for the fact that on this hardware, lower numbered TXQs have higher dequeue priority than higher ones): num_tc 4 map 0 1 2 3 queues 1@3 1@2 1@1 1@0 is in fact denied today by mqprio. The full story is that in fact, it's only denied with "hw 0"; if hardware offloading is requested, mqprio defers TXQ range overlap validation to the device driver (a strange decision in itself). This is most certainly a bug, but it's not one that has any merit for being fixed on "stable" as far as I can tell. This is because mqprio always rejected a configuration which was in fact valid, and this has shaped the way in which mqprio configuration scripts got built for various hardware (see igb/igc in the link below). Therefore, one could consider it to be merely an improvement for mqprio to allow reverse TC:TXQ mappings. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-9-vladimir.oltean@nxp.com/#25188310 Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230128010719.2182346-6-vladimir.oltean@nxp.com/#25186442 Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:43 +00:00
Vladimir Oltean	5cfb45e2fb	net/sched: mqprio: refactor offloading and unoffloading to dedicated functions Some more logic will be added to mqprio offloading, so split that code up from mqprio_init(), which is already large, and create a new function, mqprio_enable_offload(), similar to taprio_enable_offload(). Also create the opposite function mqprio_disable_offload(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:43 +00:00
Vladimir Oltean	feb2cf3dcf	net/sched: mqprio: refactor nlattr parsing to a separate function mqprio_init() is quite large and unwieldy to add more code to. Split the netlink attribute parsing to a dedicated function. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 10:06:43 +00:00
Jesper Dangaard Brouer	9dde0cd3b1	net: introduce skb_poison_list and use in kfree_skb_list First user of skb_poison_list is in kfree_skb_list_reason, to catch bugs earlier like introduced in commit `eedade12f4` ("net: kfree_skb_list use kmem_cache_free_bulk"). For completeness mentioned bug have been fixed in commit `f72ff8b81e` ("net: fix kfree_skb_list use of skb_mark_not_on_list"). In case of a bug like mentioned commit we would have seen OOPS with: general protection fault, probably for non-canonical address 0xdead000000000870 And content of one the registers e.g. R13: dead000000000800 In this case skb->len is at offset 112 bytes (0x70) why fault happens at 0x800+0x70 = 0x870 Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 09:56:38 +00:00
Greg Kroah-Hartman	f6b2ce79b5	Merge 6.2-rc7 into tty-next We need the tty/serial fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-02-06 10:48:49 +01:00
Qingfang DENG	542bcea4be	net: page_pool: use in_softirq() instead We use BH context only for synchronization, so we don't care if it's actually serving softirq or not. As a side node, in case of threaded NAPI, in_serving_softirq() will return false because it's in process context with BH off, making page_pool_recycle_in_cache() unreachable. Signed-off-by: Qingfang DENG <qingfang.deng@siflower.com.cn> Tested-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 09:15:22 +00:00
Petr Machata	a1aee20d5d	net: bridge: Add netlink knobs for number / maximum MDB entries The previous patch added accounting for number of MDB entries per port and per port-VLAN, and the logic to verify that these values stay within configured bounds. However it didn't provide means to actually configure those bounds or read the occupancy. This patch does that. Two new netlink attributes are added for the MDB occupancy: IFLA_BRPORT_MCAST_N_GROUPS for the per-port occupancy and BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS for the per-port-VLAN occupancy. And another two for the maximum number of MDB entries: IFLA_BRPORT_MCAST_MAX_GROUPS for the per-port maximum, and BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS for the per-port-VLAN one. Note that the two new IFLA_BRPORT_ attributes prompt bumping of RTNL_SLAVE_MAX_TYPE to size the slave attribute tables large enough. The new attributes are used like this: # ip link add name br up type bridge vlan_filtering 1 mcast_snooping 1 \ mcast_vlan_snooping 1 mcast_querier 1 # ip link set dev v1 master br # bridge vlan add dev v1 vid 2 # bridge vlan set dev v1 vid 1 mcast_max_groups 1 # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1 # bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1 Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1. # bridge link set dev v1 mcast_max_groups 1 # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 2 Error: bridge: Port is already in 1 groups, and mcast_max_groups=1. # bridge -d link show 5: v1@v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br [...] [...] mcast_n_groups 1 mcast_max_groups 1 # bridge -d vlan show port vlan-id br 1 PVID Egress Untagged state forwarding mcast_router 1 v1 1 PVID Egress Untagged [...] mcast_n_groups 1 mcast_max_groups 1 2 [...] mcast_n_groups 0 mcast_max_groups 0 Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:26 +00:00
Petr Machata	b57e8d870d	net: bridge: Maintain number of MDB entries in net_bridge_mcast_port The MDB maintained by the bridge is limited. When the bridge is configured for IGMP / MLD snooping, a buggy or malicious client can easily exhaust its capacity. In SW datapath, the capacity is configurable through the IFLA_BR_MCAST_HASH_MAX parameter, but ultimately is finite. Obviously a similar limit exists in the HW datapath for purposes of offloading. In order to prevent the issue of unilateral exhaustion of MDB resources, introduce two parameters in each of two contexts: - Per-port and per-port-VLAN number of MDB entries that the port is member in. - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled) per-port-VLAN maximum permitted number of MDB entries, or 0 for no limit. The per-port multicast context is used for tracking of MDB entries for the port as a whole. This is available for all bridges. The per-port-VLAN multicast context is then only available on VLAN-filtering bridges on VLANs that have multicast snooping on. With these changes in place, it will be possible to configure MDB limit for bridge as a whole, or any one port as a whole, or any single port-VLAN. Note that unlike the global limit, exhaustion of the per-port and per-port-VLAN maximums does not cause disablement of multicast snooping. It is also permitted to configure the local limit larger than hash_max, even though that is not useful. In this patch, introduce only the accounting for number of entries, and the max field itself, but not the means to toggle the max. The next patch introduces the netlink APIs to toggle and read the values. Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:26 +00:00
Petr Machata	d47230a348	net: bridge: Add a tracepoint for MDB overflows The following patch will add two more maximum MDB allowances to the global one, mcast_hash_max, that exists today. In all these cases, attempts to add MDB entries above the configured maximums through netlink, fail noisily and obviously. Such visibility is missing when adding entries through the control plane traffic, by IGMP or MLD packets. To improve visibility in those cases, add a trace point that reports the violation, including the relevant netdevice (be it a slave or the bridge itself), and the MDB entry parameters: # perf record -e bridge:br_mdb_full & # [...] # perf script \| cut -d: -f4- dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 0 dev v2 af 10 src :: grp ff0e::112/00:00:00:00:00:00 vid 0 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 10 dev v2 af 10 src 2001:db8:1::1 grp ff0e::1/00:00:00:00:00:00 vid 10 dev v2 af 2 src ::ffff:192.0.2.1 grp ::ffff:239.1.1.1/00:00:00:00:00:00 vid 10 CC: Steven Rostedt <rostedt@goodmis.org> CC: linux-trace-kernel@vger.kernel.org Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Petr Machata	eceb30854f	net: bridge: Change a cleanup in br_multicast_new_port_group() to goto This function is getting more to clean up in the following patches. Structuring the cleanups in one labeled block will allow reusing the same cleanup from several places. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Petr Machata	976b3858dd	net: bridge: Add br_multicast_del_port_group() Since cleaning up the effects of br_multicast_new_port_group() just consists of delisting and freeing the memory, the function br_mdb_add_group_star_g() inlines the corresponding code. In the following patches, number of per-port and per-port-VLAN MDB entries is going to be maintained, and that counter will have to be updated. Because that logic is going to be hidden in the br_multicast module, introduce a new hook intended to again remove a newly-created group. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Petr Machata	1c85b80b20	net: bridge: Move extack-setting to br_multicast_new_port_group() Now that br_multicast_new_port_group() takes an extack argument, move setting the extack there. The downside is that the error messages end up being less specific (the function cannot distinguish between (S,G) and (*,G) groups). However, the alternative is to check in the caller whether the callee set the extack, and if it didn't, set it. But that is only done when the callee is not exactly known. (E.g. in case of a notifier invocation.) Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Petr Machata	60977a0c63	net: bridge: Add extack to br_multicast_new_port_group() Make it possible to set an extack in br_multicast_new_port_group(). Eventually, this function will check for per-port and per-port-vlan MDB maximums, and will use the extack to communicate the reason for the bounce. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Petr Machata	c00041cf1c	net: bridge: Set strict_start_type at two policies Make any attributes newly-added to br_port_policy or vlan_tunnel_policy parsed strictly, to prevent userspace from passing garbage. Note that this patchset only touches the former policy. The latter was adjusted for completeness' sake. There do not appear to be other _deprecated calls with non-NULL policies. Suggested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:48:25 +00:00
Julian Anastasov	c1d2ecdf5e	neigh: make sure used and confirmed times are valid Entries can linger in cache without timer for days, thanks to the gc_thresh1 limit. As result, without traffic, the confirmed time can be outdated and to appear to be in the future. Later, on traffic, NUD_STALE entries can switch to NUD_DELAY and start the timer which can see the invalid confirmed time and wrongly switch to NUD_REACHABLE state instead of NUD_PROBE. As result, timer is set many days in the future. This is more visible on 32-bit platforms, with higher HZ value. Why this is a problem? While we expect unused entries to expire, such entries stay in REACHABLE state for too long, locked in cache. They are not expired normally, only when cache is full. Problem and the wrong state change reported by Zhang Changzhong: 172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE 350520 seconds have elapsed since this entry was last updated, but it is still in the REACHABLE state (base_reachable_time_ms is 30000), preventing lladdr from being updated through probe. Fix it by ensuring timer is started with valid used/confirmed times. Considering the valid time range is LONG_MAX jiffies, we try not to go too much in the past while we are in DELAY/PROBE state. There are also places that need used/updated times to be validated while timer is not running. Reported-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-06 08:44:31 +00:00
D. Wythe	aff7bfed90	net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore It's clear that rmbs_lock and sndbufs_lock are aims to protect the rmbs list or the sndbufs list. During connection establieshment, smc_buf_get_slot() will always be invoked, and it only performs read semantics in rmbs list and sndbufs list. Based on the above considerations, we replace mutex with rw_semaphore. Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot() run concurrently, other part use down_write() to keep exclusive semantics. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-04 09:48:19 +00:00
D. Wythe	4da687448d	net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() Unlike smc_buf_create() and smcr_buf_unuse(), smcr_lgr_reg_rmbs() is exclusive when assigned rmb_desc was not registered, although it can be executed in parallel when assigned rmb_desc was registered already and only performs read semtamics on it. Hence, we can not simply replace it with read semaphore. The idea here is that if the assigned rmb_desc was registered already, use read semaphore to protect the critical section, once the assigned rmb_desc was not registered, keep using keep write semaphore still to keep its exclusivity. Thanks to the reusable features of rmb_desc, which allows us to execute in parallel in most cases. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-04 09:48:19 +00:00
D. Wythe	f6421014e8	net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse() Following is part of Off-CPU graph during frequent SMC-R short-lived processing: process_one_work (51.19%) smc_close_passive_work (28.36%) smcr_buf_unuse (28.34%) rwsem_down_write_slowpath (28.22%) smc_listen_work (22.83%) smc_clc_wait_msg (1.84%) smc_buf_create (20.45%) smcr_buf_map_usable_links rwsem_down_write_slowpath (20.43%) smcr_lgr_reg_rmbs (0.53%) rwsem_down_write_slowpath (0.43%) smc_llc_do_confirm_rkey (0.08%) We can clearly see that during the connection establishment time, waiting time of connections is not on IO, but on llc_conf_mutex. What is more important, the core critical area (smcr_buf_unuse() & smc_buf_create()) only perfroms read semantics on links, we can easily replace it with read semaphore. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-04 09:48:19 +00:00
D. Wythe	b5dd4d6981	net/smc: llc_conf_mutex refactor, replace it with rw_semaphore llc_conf_mutex was used to protect links and link related configurations in the same link group, for example, add or delete links. However, in most cases, the protected critical area has only read semantics and with no write semantics at all, such as obtaining a usable link or an available rmb_desc. This patch do simply code refactoring, replace mutex with rw_semaphore, replace mutex_lock with down_write and replace mutex_unlock with up_write. Theoretically, this replacement is equivalent, but after this patch, we can distinguish lock granularity according to different semantics of critical areas. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-04 09:48:19 +00:00
Eric Dumazet	6579f5bacc	raw: use net_hash_mix() in hash function Some applications seem to rely on RAW sockets. If they use private netns, we can avoid piling all RAW sockets bound to a given protocol into a single bucket. Also place (struct raw_hashinfo).lock into its own cache line to limit false sharing. Alternative would be to have per-netns hashtables, but this seems too expensive for most netns where RAW sockets are not used. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:56:23 -08:00
Eric Dumazet	42186e6c00	ipv4: raw: add drop reasons Use existing helpers and drop reason codes for RAW input path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:56:23 -08:00
Eric Dumazet	8d8ebd77f5	ipv6: raw: add drop reasons Use existing helpers and drop reason codes for RAW input path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:56:23 -08:00
Moshe Shemesh	7c976c7cfc	devlink: Move devlink dev selftest code to dev Move devlink dev selftest callbacks and related code from leftover.c to file dev.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:26 -08:00
Moshe Shemesh	ec4a0ce92e	devlink: Move devlink_info_req struct to be local As all users of the struct devlink_info_req are already in dev.c, move this struct from devl_internal.c to be local in dev.c. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:26 -08:00
Moshe Shemesh	a13aab66cb	devlink: Move devlink dev flash code to dev Move devlink dev flash callbacks, helpers and other related code from leftover.c to dev.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:26 -08:00
Moshe Shemesh	d60191c46e	devlink: Move devlink dev info code to dev Move devlink dev info callbacks, related drivers helpers functions and other related code from leftover.c to dev.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:26 -08:00
Moshe Shemesh	af2f8c1f82	devlink: Move devlink dev eswitch code to dev Move devlink dev eswitch callbacks and related code from leftover.c to file dev.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:25 -08:00
Moshe Shemesh	c6ed7d6ef9	devlink: Move devlink dev reload code to dev Move devlink dev reload callback and related code from leftover.c to file dev.c. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:25 -08:00
Moshe Shemesh	dbeeca81bd	devlink: Split out dev get and dump code Move devlink dev get and dump callbacks and related dev code to new file dev.c. This file shall include all callbacks that are specific on devlink dev object. No functional change in this patch. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:25:25 -08:00
Vladimir Oltean	d795527d50	net: dsa: use NL_SET_ERR_MSG_WEAK_MOD() more consistently Now that commit `028fb19c6b` ("netlink: provide an ability to set default extack message") provides a weak function that doesn't override an existing extack message provided by the driver, it makes sense to use it also for LAG and HSR offloading, not just for bridge offloading. Also consistently put the message string on a separate line, to reduce line length from 92 to 84 characters. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20230202140354.3158129-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-03 19:23:32 -08:00
Christoph Hellwig	1eb9cd1500	libceph: use bvec_set_page to initialize bvecs Use the bvec_set_page helper to initialize bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Link: https://lore.kernel.org/r/20230203150634.3199647-24-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 10:17:43 -07:00
Christoph Hellwig	9088151f1b	sunrpc: use bvec_set_page to initialize bvecs Use the bvec_set_page helper to initialize bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> Link: https://lore.kernel.org/r/20230203150634.3199647-22-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 10:17:42 -07:00
Christoph Hellwig	efde918ac6	rxrpc: use bvec_set_page to initialize a bvec Use the bvec_set_page helper to initialize a bvec. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230203150634.3199647-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 10:17:42 -07:00
Vlad Buslov	df25455e5a	netfilter: nf_conntrack: allow early drop of offloaded UDP conns Both synchronous early drop algorithm and asynchronous gc worker completely ignore connections with IPS_OFFLOAD_BIT status bit set. With new functionality that enabled UDP NEW connection offload in action CT malicious user can flood the conntrack table with offloaded UDP connections by just sending a single packet per 5tuple because such connections can no longer be deleted by early drop algorithm. To mitigate the issue allow both early drop and gc to consider offloaded UDP connections for deletion. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-02-03 09:31:24 +00:00

... 6 7 8 9 10 ...

72508 commits