'ip6 dscp set $v' in an nftables outpute route chain has no effect.
While nftables does detect the dscp change and calls the reroute hook.
But ip6_route_me_harder never sets the dscp/flowlabel:
flowlabel/dsfield routing rules are ignored and no reroute takes place.
Thanks to Yi Chen for an excellent reproducer script that I used
to validate this change.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Yi Chen <yiche@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Lion Ackermann reported that there is a race condition between namespace cleanup
in ipset and the garbage collection of the list:set type. The namespace
cleanup can destroy the list:set type of sets while the gc of the set type is
waiting to run in rcu cleanup. The latter uses data from the destroyed set which
thus leads use after free. The patch contains the following parts:
- When destroying all sets, first remove the garbage collectors, then wait
if needed and then destroy the sets.
- Fix the badly ordered "wait then remove gc" for the destroy a single set
case.
- Fix the missing rcu locking in the list:set type in the userspace test
case.
- Use proper RCU list handlings in the list:set type.
The patch depends on c1193d9bbb (netfilter: ipset: Add list flush to cancel_gc).
Fixes: 97f7cf1cd8 (netfilter: ipset: fix performance regression in swap operation)
Reported-by: Lion Ackermann <nnamrec@gmail.com>
Tested-by: Lion Ackermann <nnamrec@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Check for mandatory netlink attributes in payload and meta expression
when used embedded from the inner expression, otherwise NULL pointer
dereference is possible from userspace.
Fixes: a150d122b6 ("netfilter: nft_meta: add inner match support")
Fixes: 3a07327d10 ("netfilter: nft_inner: support for inner tunnel header matching")
Signed-off-by: Davide Ornaghi <d.ornaghi97@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Due to timer wheel implementation, a timer will usually fire
after its schedule.
For instance, for HZ=1000, a timeout between 512ms and 4s
has a granularity of 64ms.
For this range of values, the extra delay could be up to 63ms.
For TCP, this means that tp->rcv_tstamp may be after
inet_csk(sk)->icsk_timeout whenever the timer interrupt
finally triggers, if one packet came during the extra delay.
We need to make sure tcp_rtx_probe0_timed_out() handles this case.
Fixes: e89688e3e9 ("net: tcp: fix unexcepted socket die when snd_wnd is 0")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Menglong Dong <imagedong@tencent.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20240607125652.1472540-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The creation of new subflows can fail for different reasons. If no
subflow have been created using the received ADD_ADDR, the related
counters should not be updated, otherwise they will never be decremented
for events related to this ID later on.
For the moment, the number of accepted ADD_ADDR is only decremented upon
the reception of a related RM_ADDR, and only if the remote address ID is
currently being used by at least one subflow. In other words, if no
subflow can be created with the received address, the counter will not
be decremented. In this case, it is then important not to increment
pm.add_addr_accepted counter, and not to modify pm.accept_addr bit.
Note that this patch does not modify the behaviour in case of failures
later on, e.g. if the MP Join is dropped or rejected.
The "remove invalid addresses" MP Join subtest has been modified to
validate this case. The broadcast IP address is added before the "valid"
address that will be used to successfully create a subflow, and the
limit is decreased by one: without this patch, it was not possible to
create the last subflow, because:
- the broadcast address would have been accepted even if it was not
usable: the creation of a subflow to this address results in an error,
- the limit of 2 accepted ADD_ADDR would have then been reached.
Fixes: 01cacb00b3 ("mptcp: add netlink-based PM")
Cc: stable@vger.kernel.org
Co-developed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: YonglongLi <liyonglong@chinatelecom.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240607-upstream-net-20240607-misc-fixes-v1-3-1ab9ddfa3d00@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The RmAddr MIB counter is supposed to be incremented once when a valid
RM_ADDR has been received. Before this patch, it could have been
incremented as many times as the number of subflows connected to the
linked address ID, so it could have been 0, 1 or more than 1.
The "RmSubflow" is incremented after a local operation. In this case,
it is normal to tied it with the number of subflows that have been
actually removed.
The "remove invalid addresses" MP Join subtest has been modified to
validate this case. A broadcast IP address is now used instead: the
client will not be able to create a subflow to this address. The
consequence is that when receiving the RM_ADDR with the ID attached to
this broadcast IP address, no subflow linked to this ID will be found.
Fixes: 7a7e52e38a ("mptcp: add RM_ADDR related mibs")
Cc: stable@vger.kernel.org
Co-developed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: YonglongLi <liyonglong@chinatelecom.cn>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240607-upstream-net-20240607-misc-fixes-v1-2-1ab9ddfa3d00@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is strictly related to commit fb7a0d3348 ("mptcp: ensure snd_nxt
is properly initialized on connect"). It turns out that syzkaller can
trigger the retransmit after fallback and before processing any other
incoming packet - so that snd_una is still left uninitialized.
Address the issue explicitly initializing snd_una together with snd_nxt
and write_seq.
Suggested-by: Mat Martineau <martineau@kernel.org>
Fixes: 8fd738049a ("mptcp: fallback in case of simultaneous connect")
Cc: stable@vger.kernel.org
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/485
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240607-upstream-net-20240607-misc-fixes-v1-1-1ab9ddfa3d00@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the noop_qdisc owner isn't initialized, then it will be 0,
so packets will erroneously be regarded as having been subject
to recursion as long as only CPU 0 queues them. For non-SMP,
that's all packets, of course. This causes a change in what's
reported to userspace, normally noop_qdisc would drop packets
silently, but with this change the syscall returns -ENOBUFS if
RECVERR is also set on the socket.
Fix this by initializing the owner field to -1, just like it
would be for dynamically allocated qdiscs by qdisc_alloc().
Fixes: 0f022d32c3 ("net/sched: Fix mirred deadlock on device recursion")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240607175340.786bfb938803.I493bf8422e36be4454c08880a8d3703cea8e421a@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The amp_id argument of l2cap_connect() was removed in
commit 84a4bb6548 ("Bluetooth: HCI: Remove HCI_AMP support")
It was always called with amp_id == 0, i.e. AMP_ID_BREDR == 0x00 (ie.
non-AMP controller). In the above commit, the code path for amp_id != 0
was preserved, although it should have used the amp_id == 0 one.
Restore the previous behavior of the non-AMP code path, to fix problems
with L2CAP connections.
Fixes: 84a4bb6548 ("Bluetooth: HCI: Remove HCI_AMP support")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This removes the bogus check for max > hcon->le_conn_max_interval since
the later is just the initial maximum conn interval not the maximum the
stack could support which is really 3200=4000ms.
In order to pass GAP/CONN/CPUP/BV-05-C one shall probably enter values
of the following fields in IXIT that would cause hci_check_conn_params
to fail:
TSPX_conn_update_int_min
TSPX_conn_update_int_max
TSPX_conn_update_peripheral_latency
TSPX_conn_update_supervision_timeout
Link: https://github.com/bluez/bluez/issues/847
Fixes: e4b019515f ("Bluetooth: Enforce validation on max value of connection interval")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When setting up an advertisement the code shall always attempt to use
the handle set by the instance since it may not be equal to the instance
ID.
Fixes: e77f43d531 ("Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
tcp_v6_syn_recv_sock() calls ip6_dst_store() before
inet_sk(newsk)->pinet6 has been set up.
This means ip6_dst_store() writes over the parent (listener)
np->dst_cookie.
This is racy because multiple threads could share the same
parent and their final np->dst_cookie could be wrong.
Move ip6_dst_store() call after inet_sk(newsk)->pinet6
has been changed and after the copy of parent ipv6_pinfo.
Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fix an occasional memory overwrite caused by a fix added in 6.10
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmZjB6EACgkQM2qzM29m
f5cPXBAAstE7c7QAYCQtAwK/C/rsbr8xjAxcjivwAEOFl8znKc4ZTJ3u2tVSgAqI
DSbjXK/JVOW10ahDLawqpnWuqTUHtGlK4CMVRdZkmfEqokulhny+FmelJ4B0ZWMQ
fUSSdYO4ONuMRcytnFperqI0LL7XuEnSsKnE1cQ4HIybQchJyilj9c/TERIVGWzn
eHcVke07JF+lMvHWWJoDCcHhHHtl82znDPpmrsyzOLKS+jtiaP41r2ZPtohhoPDm
9VQmTqpSXWyexXpzgQzyu4TOWz6PPakoa701x7pW/2wZgzabdRaJhkYea64Bc8GK
RTfr7TpSKkcnttQSb2jbb2+qG15g+f7nDbWtejqWuhEppwdj/EznuyrHi80N5Klf
pw4+XFU8HoC15jcdltMb2Jecbv6uulQ0AcfCOLUh61E86tiy7YDChmhRLgDhzRsS
jDvq8HxRe1CNmqoxfqIuWuWiMwyCRSSd06U7u3/eqBZeLXOxQb26bxG8GXitUPgY
G4SCCSvExs2YqnegsZNQGB0L5YKd/zhhnkoc/Y29CU9ZhoItgHlblKODsix9kVJl
ckN/qSZS5/rrV7IsvzsB0ywNb5VkfnkspHu7X8tlYBuwgnTHJCp25AuWVt31/6jH
E+tW4wdvd1oBTLqpWRewclTwUrmvJYVYirS01LiTVehx/mnJdw8=
=NvAo
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Fix an occasional memory overwrite caused by a fix added in 6.10
* tag 'nfsd-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Fix loop termination condition in gss_free_in_token_pages()
While dumping sockets via UNIX_DIAG, we do not hold unix_state_lock().
Let's use READ_ONCE() to read sk->sk_shutdown.
Fixes: e4e541a848 ("sock-diag: Report shutdown for inet and unix sockets (v2)")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We can dump the socket queue length via UNIX_DIAG by specifying
UDIAG_SHOW_RQLEN.
If sk->sk_state is TCP_LISTEN, we return the recv queue length,
but here we do not hold recvq lock.
Let's use skb_queue_len_lockless() in sk_diag_show_rqlen().
Fixes: c9da99e647 ("unix_diag: Fixup RQLEN extension report")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If the socket type is SOCK_STREAM or SOCK_SEQPACKET, unix_release_sock()
checks the length of the peer socket's recvq under unix_state_lock().
However, unix_stream_read_generic() calls skb_unlink() after releasing
the lock. Also, for SOCK_SEQPACKET, __skb_try_recv_datagram() unlinks
skb without unix_state_lock().
Thues, unix_state_lock() does not protect qlen.
Let's use skb_queue_empty_lockless() in unix_release_sock().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Once sk->sk_state is changed to TCP_LISTEN, it never changes.
unix_accept() takes advantage of this characteristics; it does not
hold the listener's unix_state_lock() and only acquires recvq lock
to pop one skb.
It means unix_state_lock() does not prevent the queue length from
changing in unix_stream_connect().
Thus, we need to use unix_recvq_full_lockless() to avoid data-race.
Now we remove unix_recvq_full() as no one uses it.
Note that we can remove READ_ONCE() for sk->sk_max_ack_backlog in
unix_recvq_full_lockless() because of the following reasons:
(1) For SOCK_DGRAM, it is a written-once field in unix_create1()
(2) For SOCK_STREAM and SOCK_SEQPACKET, it is changed under the
listener's unix_state_lock() in unix_listen(), and we hold
the lock in unix_stream_connect()
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net->unx.sysctl_max_dgram_qlen is exposed as a sysctl knob and can be
changed concurrently.
Let's use READ_ONCE() in unix_create1().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
sk_setsockopt() changes sk->sk_sndbuf under lock_sock(), but it's
not used in af_unix.c.
Let's use READ_ONCE() to read sk->sk_sndbuf in unix_writable(),
unix_dgram_sendmsg(), and unix_stream_sendmsg().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
While dumping AF_UNIX sockets via UNIX_DIAG, sk->sk_state is read
locklessly.
Let's use READ_ONCE() there.
Note that the result could be inconsistent if the socket is dumped
during the state change. This is common for other SOCK_DIAG and
similar interfaces.
Fixes: c9da99e647 ("unix_diag: Fixup RQLEN extension report")
Fixes: 2aac7a2cb0 ("unix_diag: Pending connections IDs NLA")
Fixes: 45a96b9be6 ("unix_diag: Dumping all sockets core")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
unix_stream_read_skb() is called from sk->sk_data_ready() context
where unix_state_lock() is not held.
Let's use READ_ONCE() there.
Fixes: 77462de14a ("af_unix: Add read_sock for stream socket types")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The following functions read sk->sk_state locklessly and proceed only if
the state is TCP_ESTABLISHED.
* unix_stream_sendmsg
* unix_stream_read_generic
* unix_seqpacket_sendmsg
* unix_seqpacket_recvmsg
Let's use READ_ONCE() there.
Fixes: a05d2ad1c1 ("af_unix: Only allow recv on connected seqpacket sockets.")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Once sk->sk_state is changed to TCP_LISTEN, it never changes.
unix_accept() takes the advantage and reads sk->sk_state without
holding unix_state_lock().
Let's use READ_ONCE() there.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As small optimisation, unix_stream_connect() prefetches the client's
sk->sk_state without unix_state_lock() and checks if it's TCP_CLOSE.
Later, sk->sk_state is checked again under unix_state_lock().
Let's use READ_ONCE() for the first check and TCP_CLOSE directly for
the second check.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
unix_poll() and unix_dgram_poll() read sk->sk_state locklessly and
calls unix_writable() which also reads sk->sk_state without holding
unix_state_lock().
Let's use READ_ONCE() in unix_poll() and unix_dgram_poll() and pass
it to unix_writable().
While at it, we remove TCP_SYN_SENT check in unix_dgram_poll() as
that state does not exist for AF_UNIX socket since the code was added.
Fixes: 1586a5877d ("af_unix: do not report POLLOUT on listeners")
Fixes: 3c73419c09 ("af_unix: fix 'poll for write'/ connected DGRAM sockets")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ioctl(SIOCINQ) calls unix_inq_len() that checks sk->sk_state first
and returns -EINVAL if it's TCP_LISTEN.
Then, for SOCK_STREAM sockets, unix_inq_len() returns the number of
bytes in recvq.
However, unix_inq_len() does not hold unix_state_lock(), and the
concurrent listen() might change the state after checking sk->sk_state.
If the race occurs, 0 is returned for the listener, instead of -EINVAL,
because the length of skb with embryo is 0.
We could hold unix_state_lock() in unix_inq_len(), but it's overkill
given the result is true for pre-listen() TCP_CLOSE state.
So, let's use READ_ONCE() for sk->sk_state in unix_inq_len().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
sk->sk_state is changed under unix_state_lock(), but it's read locklessly
in many places.
This patch adds WRITE_ONCE() on the writer side.
We will add READ_ONCE() to the lockless readers in the following patches.
Fixes: 83301b5367 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
to BPF SOCKMAP.
When the socket disconnects from the peer by connect(AF_UNSPEC), the state
is set back to TCP_CLOSE.
Then, the peer's state is also set to TCP_CLOSE, but the update is done
locklessly and unconditionally.
Let's say socket A connect()ed to B, B connect()ed to C, and A disconnects
from B.
After the first two connect()s, all three sockets' sk->sk_state are
TCP_ESTABLISHED:
$ ss -xa
Netid State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
u_dgr ESTAB 0 0 @A 641 * 642
u_dgr ESTAB 0 0 @B 642 * 643
u_dgr ESTAB 0 0 @C 643 * 0
And after the disconnect, B's state is TCP_CLOSE even though it's still
connected to C and C's state is TCP_ESTABLISHED.
$ ss -xa
Netid State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
u_dgr UNCONN 0 0 @A 641 * 0
u_dgr UNCONN 0 0 @B 642 * 643
u_dgr ESTAB 0 0 @C 643 * 0
In this case, we cannot register B to SOCKMAP.
So, when a socket disconnects from the peer, we should not set TCP_CLOSE to
the peer if the peer is connected to yet another socket, and this must be
done under unix_state_lock().
Note that we use WRITE_ONCE() for sk->sk_state as there are many lockless
readers. These data-races will be fixed in the following patches.
Fixes: 83301b5367 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZmAYPgAKCRDbK58LschI
g2XdAP9M8zYLRw4IG8DUFug7F+oqRPqgbs+Gvsf9YNl5/PSiTQEA6WKa/ObaG/W9
vre9VxhMWKgcMfzqZyztNHAiDm8R+QI=
=l7gV
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-06-05
We've added 8 non-merge commits during the last 6 day(s) which contain
a total of 9 files changed, 34 insertions(+), 35 deletions(-).
The main changes are:
1) Fix a potential use-after-free in bpf_link_free when the link uses
dealloc_deferred to free the link object but later still tests for
presence of link->ops->dealloc, from Cong Wang.
2) Fix BPF test infra to set the run context for rawtp test_run callback
where syzbot reported a crash, from Jiri Olsa.
3) Fix bpf_session_cookie BTF_ID in the special_kfunc_set list to exclude
it for the case of !CONFIG_FPROBE, also from Jiri Olsa.
4) Fix a Coverity static analysis report to not close() a link_fd of -1
in the multi-uprobe feature detector, from Andrii Nakryiko.
5) Revert support for redirect to any xsk socket bound to the same umem
as it can result in corrupted ring state which can lead to a crash when
flushing rings. A different approach will be pursued for bpf-next to
address it safely, from Magnus Karlsson.
6) Fix inet_csk_accept prototype in test_sk_storage_tracing.c which caused
BPF CI failure after the last tree fast forwarding, from Andrii Nakryiko.
7) Fix a coccicheck warning in BPF devmap that iterator variable cannot
be NULL, from Thorsten Blum.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
Revert "xsk: Document ability to redirect to any socket bound to the same umem"
Revert "xsk: Support redirect to any socket bound to the same umem"
bpf: Set run context for rawtp test_run callback
bpf: Fix a potential use-after-free in bpf_link_free()
bpf, devmap: Remove unnecessary if check in for loop
libbpf: don't close(-1) in multi-uprobe feature detector
bpf: Fix bpf_session_cookie BTF_ID in special_kfunc_set list
selftests/bpf: fix inet_csk_accept prototype in test_sk_storage_tracing.c
====================
Link: https://lore.kernel.org/r/20240605091525.22628-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If one TCA_TAPRIO_ATTR_PRIOMAP attribute has been provided,
taprio_parse_mqprio_opt() must validate it, or userspace
can inject arbitrary data to the kernel, the second time
taprio_change() is called.
First call (with valid attributes) sets dev->num_tc
to a non zero value.
Second call (with arbitrary mqprio attributes)
returns early from taprio_parse_mqprio_opt()
and bad things can happen.
Fixes: a3d43c0d56 ("taprio: Add support adding an admin schedule")
Reported-by: Noam Rathaus <noamr@ssd-disclosure.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20240604181511.769870-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jaroslav reports Dell's OMSA Systems Management Data Engine
expects NLM_DONE in a separate recvmsg(), both for rtnl_dump_ifinfo()
and inet_dump_ifaddr(). We already added a similar fix previously in
commit 460b0d33cf ("inet: bring NLM_DONE out to a separate recv() again")
Instead of modifying all the dump handlers, and making them look
different than modern for_each_netdev_dump()-based dump handlers -
put the workaround in rtnetlink code. This will also help us move
the custom rtnl-locking from af_netlink in the future (in net-next).
Note that this change is not touching rtnl_dump_all(). rtnl_dump_all()
is different kettle of fish and a potential problem. We now mix families
in a single recvmsg(), but NLM_DONE is not coalesced.
Tested:
./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_addr.yaml \
--dump getaddr --json '{"ifa-family": 2}'
./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \
--dump getroute --json '{"rtm-family": 2}'
./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_link.yaml \
--dump getlink
Fixes: 3e41af9076 ("rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()")
Fixes: cdb2f80f1c ("inet: use xa_array iterator to implement inet_dump_ifaddr()")
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like previous patch does in TCP, we need to adhere to RFC 1213:
"tcpCurrEstab OBJECT-TYPE
...
The number of TCP connections for which the current state
is either ESTABLISHED or CLOSE- WAIT."
So let's consider CLOSE-WAIT sockets.
The logic of counting
When we increment the counter?
a) Only if we change the state to ESTABLISHED.
When we decrement the counter?
a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT,
say, on the client side, changing from ESTABLISHED to FIN-WAIT-1.
b) if the socket leaves CLOSE-WAIT, say, on the server side, changing
from CLOSE-WAIT to LAST-ACK.
Fixes: d9cd27b8cd ("mptcp: add CurrEstab MIB counter support")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to RFC 1213, we should also take CLOSE-WAIT sockets into
consideration:
"tcpCurrEstab OBJECT-TYPE
...
The number of TCP connections for which the current state
is either ESTABLISHED or CLOSE- WAIT."
After this, CurrEstab counter will display the total number of
ESTABLISHED and CLOSE-WAIT sockets.
The logic of counting
When we increment the counter?
a) if we change the state to ESTABLISHED.
b) if we change the state from SYN-RECEIVED to CLOSE-WAIT.
When we decrement the counter?
a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT,
say, on the client side, changing from ESTABLISHED to FIN-WAIT-1.
b) if the socket leaves CLOSE-WAIT, say, on the server side, changing
from CLOSE-WAIT to LAST-ACK.
Please note: there are two chances that old state of socket can be changed
to CLOSE-WAIT in tcp_fin(). One is SYN-RECV, the other is ESTABLISHED.
So we have to take care of the former case.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
q->bands will be assigned to qopt->bands to execute subsequent code logic
after kmalloc. So the old q->bands should not be used in kmalloc.
Otherwise, an out-of-bounds write will occur.
Fixes: c2999f7fb0 ("net: sched: multiq: don't call qdisc_put() while holding tree lock")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When copying smc settings to clcsock, avoid setting clcsock's sk_sndbuf
to sysctl_tcp_wmem[1], since this may overwrite the value set by
tcp_sndbuf_expand() in TCP connection establishment.
And the other setting sk_{snd|rcv}buf to sysctl value in
smc_adjust_sock_bufsizes() can also be omitted since the initialization
of smc sock and clcsock has set sk_{snd|rcv}buf to smc.sysctl_{w|r}mem
or ipv4_sysctl_tcp_{w|r}mem[1].
Fixes: 30c3c4a449 ("net/smc: Use correct buffer sizes when switching between TCP and SMC")
Link: https://lore.kernel.org/r/5eaf3858-e7fd-4db8-83e8-3d7a3e0e9ae2@linux.alibaba.com
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>, too.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 2863d665ea.
This patch introduced a potential kernel crash when multiple napi instances
redirect to the same AF_XDP socket. By removing the queue_index check, it is
possible for multiple napi instances to access the Rx ring at the same time,
which will result in a corrupted ring state which can lead to a crash when
flushing the rings in __xsk_flush(). This can happen when the linked list of
sockets to flush gets corrupted by concurrent accesses. A quick and small fix
is not possible, so let us revert this for now.
Reported-by: Yuval El-Hanany <YuvalE@radware.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/xdp-newbies/8100DBDC-0B7C-49DB-9995-6027F6E63147@radware.com
Link: https://lore.kernel.org/bpf/20240604122927.29080-2-magnus.karlsson@gmail.com
For TLS offload we mark packets with skb->decrypted to make sure
they don't escape the host without getting encrypted first.
The crypto state lives in the socket, so it may get detached
by a call to skb_orphan(). As a safety check - the egress path
drops all packets with skb->decrypted and no "crypto-safe" socket.
The skb marking was added to sendpage only (and not sendmsg),
because tls_device injected data into the TCP stack using sendpage.
This special case was missed when sendpage got folded into sendmsg.
Fixes: c5c37af6ec ("tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240530232607.82686-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The first fixes for v6.10. And we have a big one, I suspect the
biggest wireless pull request we ever had. There are fixes all over,
both in stack and drivers. Likely the most important here are mt76 not
working on mt7615 devices, ath11k not being able to connect to 6 GHz
networks and rtlwifi suffering from packet loss. But of course there's
much more.
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmZdrSMRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZuYnAgAqZMvSQKbhkYRfIua9Ygmdk8pDetEhaJg
HAUiW8ymLkWG1Md1V3tjY9Es66YeSr03Tx8xz/8pDWSaiUCdCs+u0zonhRK0sb3/
HAfyUpRIGDq/9kTM9lfL+yOTKwtRti+NmTXqeJr1CrpBejuXydRT+YYcIEmwQQxw
pm5xwLmrq74pMRE3VCTRbj2mfv/leFoDQasTeimUi6PgRrCeXSmUDy14k1zyAsI5
/uchvyvhgN54U2vIvO0RzW5zfS84cNEG+mW/+PpiuMftH2sYS1/UqT8BCQdrfzBz
PuiocUBXU68NzB4iLZhzPSDirOsYVaYpgKPfLPF1jNoj+iEQh1HhCA==
=5my7
-----END PGP SIGNATURE-----
Merge tag 'wireless-2024-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.10-rc3
The first fixes for v6.10. And we have a big one, I suspect the
biggest wireless pull request we ever had. There are fixes all over,
both in stack and drivers. Likely the most important here are mt76 not
working on mt7615 devices, ath11k not being able to connect to 6 GHz
networks and rtlwifi suffering from packet loss. But of course there's
much more.
* tag 'wireless-2024-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (37 commits)
wifi: rtlwifi: Ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS
wifi: mt76: mt7615: add missing chanctx ops
wifi: wilc1000: document SRCU usage instead of SRCU
Revert "wifi: wilc1000: set atomic flag on kmemdup in srcu critical section"
Revert "wifi: wilc1000: convert list management to RCU"
wifi: mac80211: fix UBSAN noise in ieee80211_prep_hw_scan()
wifi: mac80211: correctly parse Spatial Reuse Parameter Set element
wifi: mac80211: fix Spatial Reuse element size check
wifi: iwlwifi: mvm: don't read past the mfuart notifcation
wifi: iwlwifi: mvm: Fix scan abort handling with HW rfkill
wifi: iwlwifi: mvm: check n_ssids before accessing the ssids
wifi: iwlwifi: mvm: properly set 6 GHz channel direct probe option
wifi: iwlwifi: mvm: handle BA session teardown in RF-kill
wifi: iwlwifi: mvm: Handle BIGTK cipher in kek_kck cmd
wifi: iwlwifi: mvm: remove stale STA link data during restart
wifi: iwlwifi: dbg_ini: move iwl_dbg_tlv_free outside of debugfs ifdef
wifi: iwlwifi: mvm: set properly mac header
wifi: iwlwifi: mvm: revert gen2 TX A-MPDU size to 64
wifi: iwlwifi: mvm: d3: fix WoWLAN command version lookup
wifi: iwlwifi: mvm: fix a crash on 7265
...
====================
Link: https://lore.kernel.org/r/20240603115129.9494CC2BD10@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After fixing four different bugs involving dst_cache
users, it might be worth adding a check about BH being
blocked by dst_cache callers.
DEBUG_NET_WARN_ON_ONCE(!in_softirq());
It is not fatal, if we missed valid case where no
BH deadlock is to be feared, we might change this.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240531132636.2637995-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As explained in commit 1378817486 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
ila_output() is called from lwtunnel_output()
possibly from process context, and under rcu_read_lock().
We might be interrupted by a softirq, re-enter ila_output()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240531132636.2637995-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As explained in commit 1378817486 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
Disabling preemption in seg6_output_core() is not good enough,
because seg6_output_core() is called from process context,
lwtunnel_output() only uses rcu_read_lock().
We might be interrupted by a softirq, re-enter seg6_output_core()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable() instead of
preempt_disable().
Apply a similar change in seg6_input_core().
Fixes: fa79581ea6 ("ipv6: sr: fix several BUGs when preemption is enabled")
Fixes: 6c8702c60b ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Lebrun <dlebrun@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240531132636.2637995-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As explained in commit 1378817486 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
Disabling preemption in rpl_output() is not good enough,
because rpl_output() is called from process context,
lwtunnel_output() only uses rcu_read_lock().
We might be interrupted by a softirq, re-enter rpl_output()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable() instead of
preempt_disable().
Apply a similar change in rpl_input().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Aring <aahringo@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240531132636.2637995-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As explained in commit 1378817486 ("tipc: block BH
before using dst_cache"), net/core/dst_cache.c
helpers need to be called with BH disabled.
Disabling preemption in ioam6_output() is not good enough,
because ioam6_output() is called from process context,
lwtunnel_output() only uses rcu_read_lock().
We might be interrupted by a softirq, re-enter ioam6_output()
and corrupt dst_cache data structures.
Fix the race by using local_bh_disable() instead of
preempt_disable().
Fixes: 8cb3bf8bff ("ipv6: ioam: Add support for the ip6ip6 encapsulation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Justin Iurman <justin.iurman@uliege.be>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240531132636.2637995-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The in_token->pages[] array is not NULL terminated. This results in
the following KASAN splat:
KASAN: maybe wild-memory-access in range [0x04a2013400000008-0x04a201340000000f]
Fixes: bafa6b4d95 ("SUNRPC: Fix gss_free_in_token_pages()")
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
mac802154 devices update their dev->stats fields locklessly. Therefore
these counters should be updated atomically. Adopt SMP safe DEV_STATS_INC()
and DEV_STATS_ADD() to achieve this.
Signed-off-by: Yunshui Jiang <jiangyunshui@kylinos.cn>
Message-ID: <20240531080739.2608969-1-jiangyunshui@kylinos.cn>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
The internal handling of VLAN IDs in batman-adv is only specified for
following encodings:
* VLAN is used
- bit 15 is 1
- bit 11 - bit 0 is the VLAN ID (0-4095)
- remaining bits are 0
* No VLAN is used
- bit 15 is 0
- remaining bits are 0
batman-adv was only preparing new translation table entries (based on its
soft interface information) using this encoding format. But the receive
path was never checking if entries in the roam or TT TVLVs were also
following this encoding.
It was therefore possible to create more than the expected maximum of 4096
+ 1 entries in the originator VLAN list. Simply by setting the "remaining
bits" to "random" values in corresponding TVLV.
Cc: stable@vger.kernel.org
Fixes: 7ea7b4a142 ("batman-adv: make the TT CRC logic VLAN specific")
Reported-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>