After 614732eaa1, no refcount is maintained for the vport-vxlan module.
This allows the userspace to remove such module while vport-vxlan
devices still exist, which leads to later oops.
v1 -> v2:
- move vport 'owner' initialization in ovs_vport_ops_register()
and make such function a macro
Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry provided a syzkaller (http://github.com/google/syzkaller)
triggering a fault in sock_wake_async() when async IO is requested.
Said program stressed af_unix sockets, but the issue is generic
and should be addressed in core networking stack.
The problem is that by the time sock_wake_async() is called,
we should not access the @flags field of 'struct socket',
as the inode containing this socket might be freed without
further notice, and without RCU grace period.
We already maintain an RCU protected structure, "struct socket_wq"
so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
is the safe route.
It also reduces number of cache lines needing dirtying, so might
provide a performance improvement anyway.
In followup patches, we might move remaining flags (SOCK_NOSPACE,
SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
being mostly read and let it being shared between cpus.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a cleanup to make following patch easier to
review.
Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()
To ease backports, we rename both constants.
Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit ab450605b3.
In IPv6, we cannot inherit the dst of the original dst. ndisc packets
are IPv6 packets and may take another route than the original packet.
This patch breaks the following scenario: a packet comes from eth0 and
is forwarded through vxlan1. The encapsulated packet triggers an NS
which cannot be sent because of the wrong route.
CC: Jiri Benc <jbenc@redhat.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current unix_dgram_recvmsg does a wake up for every received
datagram. This seems wasteful as only SOCK_DGRAM client sockets in an
n:1 association with a server socket will ever wait because of the
associated condition. The patch below changes the function such that the
wake up only happens if wq_has_sleeper indicates that someone actually
wants to be notified. Testing with SOCK_SEQPACKET and SOCK_DGRAM socket
seems to confirm that this is an improvment.
Signed-Off-By: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry provided a syzkaller (http://github.com/google/syzkaller)
generated program that triggers the WARNING at
net/ipv4/tcp.c:1729 in tcp_recvmsg() :
WARN_ON(tp->copied_seq != tp->rcv_nxt &&
!(flags & (MSG_PEEK | MSG_TRUNC)));
His program is specifically attempting a Cross SYN TCP exchange,
that we support (for the pleasure of hackers ?), but it looks we
lack proper tcp->copied_seq initialization.
Thanks again Dmitry for your report and testings.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to add and remove MFC entries. It uses the
same attributes like the already present dump support in order to be
consistent. There's one new entry - RTA_PREFSRC, it's used to denote an
MFC_PROXY entry (see MRT_ADD_MFC vs MRT_ADD_MFC_PROXY).
The already existing infrastructure is used to create and delete the
entries, the netlink message gets converted internally to a struct mfcctl
which is used with ipmr_mfc_add/delete.
The other used attributes are:
RTA_IIF - used for mfcc_parent (when adding it's required to be valid)
RTA_SRC - used for mfcc_origin
RTA_DST - used for mfcc_mcastgrp
RTA_TABLE - the MRT table id
RTA_MULTIPATH - the "oifs" ttl array
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can have both errors and we'll return the second one, fix it to
return an error at a time as it's normal. I've overlooked this in my
previous set.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the inline pimsm_enabled() to pim.h and rename it to
ipmr_pimsm_enabled to show it's for the ipv4 ipmr code since pim.h is
used by IPv6 too.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the definitions of VIF_EXISTS() and struct mr_table to mroute.h
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MFC_NOTIFY was introduced in kernel 2.1.68 but afaik it hasn't been used
and I couldn't find any users currently so just remove it. Only
MFC_STATIC is left, so move it into an enum, add a description and use
BIT().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks like many files are including mroute.h unnecessarily, so remove
the include. Most importantly remove it from ipv6.
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sendpage did not care about credentials at all. This could lead to
situations in which because of fd passing between processes we could
append data to skbs with different scm data. It is illegal to splice those
skbs together. Instead we have to allocate a new skb and if requested
fill out the scm details.
Fixes: 869e7c6248 ("net: af_unix: implement stream sendpage support")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The memory barrier in the helper wq_has_sleeper is needed by just
about every user of waitqueue_active. This patch generalises it
by making it take a wait_queue_head_t directly. The existing
helper is renamed to skwq_has_sleeper.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9c7077622d ("packet: make packet_snd fail on len smaller
than l2 header") added validation for the packet size in packet_snd.
This change enforces that every packet needs a header (with at least
hard_header_len bytes) plus a payload with at least one byte. Before
this change the payload was optional.
This fixes PPPoE connections which do not have a "Service" or
"Host-Uniq" configured (which is violating the spec, but is still
widely used in real-world setups). Those are currently failing with the
following message: "pppd: packet size is too short (24 <= 24)"
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for mangling packet payload. Checksum for the specified base
header is updated automatically if requested, however no updates for any
kind of pseudo headers are supported, meaning no stateless NAT is supported.
For checksum updates different checksumming methods can be specified. The
currently supported methods are NONE for no checksum updates, and INET for
internet type checksums.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Various files are owned by root with 0440 permission. Reading them is
impossible in an unprivileged user namespace, interfering with firewall
tools. For instance, iptables-save relies on /proc/net/ip_tables_names
contents to dump only loaded tables.
This patch assigned ownership of the following files to root in the
current namespace:
- /proc/net/*_tables_names
- /proc/net/*_tables_matches
- /proc/net/*_tables_targets
- /proc/net/nf_conntrack
- /proc/net/nf_conntrack_expect
- /proc/net/netfilter/nfnetlink_log
A mapping for root must be available, so this order should be followed:
unshare(CLONE_NEWUSER);
/* Setup the mapping */
unshare(CLONE_NEWNET);
Signed-off-by: Philip Whineray <phil@firehol.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Sasha's found a NULL pointer dereference in the RDS connection code when
sending a message to an apparently unbound socket. The problem is caused
by the code checking if the socket is bound in rds_sendmsg(), which checks
the rs_bound_addr field without taking a lock on the socket. This opens a
race where rs_bound_addr is temporarily set but where the transport is not
in rds_bind(), leading to a NULL pointer dereference when trying to
dereference 'trans' in __rds_conn_create().
Vegard wrote a reproducer for this issue, so kindly ask him to share if
you're interested.
I cannot reproduce the NULL pointer dereference using Vegard's reproducer
with this patch, whereas I could without.
Complete earlier incomplete fix to CVE-2015-6937:
74e98eb085 ("RDS: verify the underlying transport exists before creating a connection")
Cc: David S. Miller <davem@davemloft.net>
Cc: stable@vger.kernel.org
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Reviewed-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During pre-upstream development, the openvswitch datapath used a custom
hashtable to store vports that could fail on delete due to lack of
memory. However, prior to upstream submission, this code was reworked to
use an hlist based hastable with flexible-array based buckets. As such
the failure condition was eliminated from the vport_del path, rendering
this comment invalid.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since (at least) commit b17a7c179d ("[NET]: Do sysfs registration as
part of register_netdevice."), netdev_run_todo() deals only with
unregistration, so we don't need to do the rtnl_unlock/lock cycle to
finish registration when failing pimreg or dvmrp device creation. In
fact that opens a race condition where someone can delete the device
while rtnl is unlocked because it's fully registered. The problem gets
worse when netlink support is introduced as there are more points of entry
that can cause it and it also makes reusing that code correctly impossible.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Normally, the transmit phase of a client call is implicitly ack'd by the
reception of the first data packet of the response being received.
However, if a security negotiation happens, the transmit phase, if it is
entirely contained in a single packet, may get an ack packet in response
and then may get aborted due to security negotiation failure.
Because the client has shifted state to RXRPC_CALL_CLIENT_AWAIT_REPLY due
to having transmitted all the data, the code that handles processing of the
received ack packet doesn't note the hard ack the data packet.
The following abort packet in the case of security negotiation failure then
incurs an assertion failure when it tries to drain the Tx queue because the
hard ack state is out of sync (hard ack means the packets have been
processed and can be discarded by the sender; a soft ack means that the
packets are received but could still be discarded and rerequested by the
receiver).
To fix this, we should record the hard ack we received for the ack packet.
The assertion failure looks like:
RxRPC: Assertion failed
1 <= 0 is false
0x1 <= 0x0 is false
------------[ cut here ]------------
kernel BUG at ../net/rxrpc/ar-ack.c:431!
...
RIP: 0010:[<ffffffffa006857b>] [<ffffffffa006857b>] rxrpc_rotate_tx_window+0xbc/0x131 [af_rxrpc]
...
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a fragmented multicast packet is received on an ethernet device which
has an active macvlan on top of it, each fragment is duplicated and
received both on the underlying device and the macvlan. If some
fragments for macvlan are processed before the whole packet for the
underlying device is reassembled, the "overlapping fragments" test in
ip6_frag_queue() discards the whole fragment queue.
To resolve this, add device ifindex to the search key and require it to
match reassembling multicast packets and packets to link-local
addresses.
Note: similar patch has been already submitted by Yoshifuji Hideaki in
http://patchwork.ozlabs.org/patch/220979/
but got lost and forgotten for some reason.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-11-23
Here's the first bluetooth-next pull request for the 4.5 kernel.
- Add new Get Advertising Size Information management command
- Add support for new system note message type on monitor channel
- Refactor LE scan changes behind separate workqueue to avoid races
- Fix issue with privacy feature when powering on adapter
- Various minor fixes & cleanups here and there
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The principal name on a gss cred is used to setup the NFSv4.0 callback,
which has to have a client principal name to authenticate to.
That code wants the name to be in the form servicetype@hostname.
rpc.svcgssd passes down such names (and passes down no principal name at
all in the case the principal isn't a service principal).
gss-proxy always passes down the principal name, and passes it down in
the form servicetype/hostname@REALM. So we've been munging the name
gss-proxy passes down into the format the NFSv4.0 callback code expects,
or throwing away the name if we can't.
Since the introduction of the MACH_CRED enforcement in NFSv4.1, we've
also been using the principal name to verify that certain operations are
done as the same principal as was used on the original EXCHANGE_ID call.
For that application, the original name passed down by gss-proxy is also
useful.
Lack of that name in some cases was causing some kerberized NFSv4.1
mount failures in an Active Directory environment.
This fix only works in the gss-proxy case. The fix for legacy
rpc.svcgssd would be more involved, and rpc.svcgssd already has other
problems in the AD case.
Reported-and-tested-by: James Ralston <ralston@pobox.com>
Acked-by: Simo Sorce <simo@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit 09605cc12c ("net ipv4: use preferred log methods") replaced
a few calls of pr_cont() after a console print without a trailing
newline by pr_info(), causing lines to be split during IP
autoconfiguration, like:
.
,
OK
IP-Config: Got DHCP answer from 192.168.97.254,
my address is 192.168.97.44
Convert these back to using pr_cont(), so it prints again:
., OK
IP-Config: Got DHCP answer from 192.168.97.254, my address is 192.168.97.44
Absorb the printing of "my address ..." into the previous call to
pr_info(), as there's no reason to use a continuation there.
Convert one more pr_info() to print nameservers while we're at it.
Fixes: 09605cc12c ("net ipv4: use preferred log methods")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the introduction of the switch gpio reset API, I'm getting
build errors in configurations that disable CONFIG_GPIOLIB:
net/dsa/dsa.c:783:16: error: implicit declaration of function 'gpio_to_desc' [-Werror=implicit-function-declaration]
The reason is that linux/gpio/consumer.h is not automatically
included without gpiolib support. This adds an explicit #include
statement to make it compile in all configurations. The reset
functionality will not work without gpiolib, which is what you
get when disabling the feature.
As far as I can tell, gpiolib is supported on all architectures
on which you can have DSA at the moment.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: cc30c16344 ("net: dsa: Add support for a switch reset gpio")
Acked-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Coverity says:
*** CID 1338065: Error handling issues (CHECKED_RETURN)
/net/tipc/udp_media.c: 162 in tipc_udp_send_msg()
156 struct udp_media_addr *dst = (struct udp_media_addr *)&dest->value;
157 struct udp_media_addr *src = (struct udp_media_addr *)&b->addr.value;
158 struct sk_buff *clone;
159 struct rtable *rt;
160
161 if (skb_headroom(skb) < UDP_MIN_HEADROOM)
>>> CID 1338065: Error handling issues (CHECKED_RETURN)
>>> Calling "pskb_expand_head" without checking return value (as is done elsewhere 51 out of 56 times).
162 pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC);
163
164 clone = skb_clone(skb, GFP_ATOMIC);
165 skb_set_inner_protocol(clone, htons(ETH_P_TIPC));
166 ub = rcu_dereference_rtnl(b->media_ptr);
167 if (!ub) {
When expanding buffer headroom over udp tunnel with pskb_expand_head(),
it's unfortunate that we don't check its return value. As a result, if
the function returns an error code due to the lack of memory, it may
cause unpredictable consequence as we unconditionally consider that
it's always successful.
Fixes: e53567948f ("tipc: conditionally expand buffer headroom over udp tunnel")
Reported-by: <scan-admin@coverity.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if we drain receive queue thoroughly in tipc_release() after tipc
socket is removed from rhashtable, it is possible that some packets
are in flight because some CPU runs receiver and did rhashtable lookup
before we removed socket. They will achieve receive queue, but nobody
delete them at all. To avoid this leak, we register a private socket
destructor to purge receive queue, meaning releasing packets pending
on receive queue will be delayed until the last reference of tipc
socket will be released.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A truncated cb_compound request will cause the client to decode null or
data from a previous callback for nfs4.1 backchannel case, or uninitialized
data for the nfs4.0 case. This is because the path through
svc_process_common() advances the request's iov_base and decrements iov_len
without adjusting the overall xdr_buf's len field. That causes
xdr_init_decode() to set up the xdr_stream with an incorrect length in
nfs4_callback_compound().
Fixing this for the nfs4.1 backchannel case first requires setting the
correct iov_len and page_len based on the length of received data in the
same manner as the nfs4.0 case.
Then the request's xdr_buf length can be adjusted for both cases based upon
the remaining iov_len and page_len.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The vmci_transport_notify_ops structures are never modified, so declare
them as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The in_cache_ops and eg_cache_ops structures are never modified, so declare
them as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out common vif init code used in both tunnel and pimreg
initialization and create ipmr_init_vif_indev() function.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take rtnl in the beginning unconditionally as most options already need
it (one exception - MRT_DONE, see the comment inside), make the
lock/unlock places central and move out the switch() local variables.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's not necessary to check for null as SLAB_PANIC is used and we'll
panic if the alloc fails, so just drop it.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial replace of ifdef with IS_BUILTIN().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a switch to determine if optname is correct and set val accordingly.
This produces a much more straight-forward and readable code.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial code and comment style fixes, also removed some extra newlines,
spaces and tabs.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the helper pimsm_enabled() which replaces the old CONFIG_IP_PIMSM
define and is used to check if any version of PIM-SM has been enabled.
Use a single if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2)
for the pim-sm shared code. This is okay w.r.t IGMPMSG_WHOLEPKT because
only a VIFF_REGISTER device can send such packet, and it can't be
created if pimsm_enabled() is false.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before mroute_reg_vif_num was defined only if any of the CONFIG_PIMSM_
options were set, but that's not really necessary as the size of the
struct is the same in both cases (checked with pahole, both cases size
is 3256 bytes) and we can remove some unnecessary ifdefs to simplify the
code.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the table id check in ipmr_new_table and make it return error
pointer. We need this change for the upcoming netlink table manipulation
support in order to avoid code duplication and a race condition.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WARN_ON_ONCE() takes a condition, it doesn't take an error message. I
have converted this to WARN() instead.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor optimization: when dealing with write chunk XDR roundup, do
not post a Write WR for the zero bytes in the pad. Simply update
the write segment in the RPC-over-RDMA header to reflect the extra
pad bytes.
The Reply chunk is also a write chunk, but the server does not use
send_write_chunks() to send the Reply chunk. That's OK in this case:
the server Upper Layer typically marshals the Reply chunk contents
in a single contiguous buffer, without a separate tail for the XDR
pad.
The comments and the variable naming refer to "chunks" but what is
really meant is "segments." The existing code sends only one
xdr_write_chunk per RPC reply.
The fix assumes this as well. When the XDR pad in the first write
chunk is reached, the assumption is the Write list is complete and
send_write_chunks() returns.
That will remain a valid assumption until the server Upper Layer can
support multiple bulk payload results per RPC.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.
Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Fixes: ec0d215f94 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
Reviewed-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The classid of a process is changed either when a process is moved to
or from a cgroup or when the net_cls.classid file is updated.
Previously net_cls only supported propogating these changes to the
cgroup's related sockets when a process was added or removed from the
cgroup. This means it was neccessary to remove and re-add all processes
to a cgroup in order to update its classid. This change introduces
support for doing this dynamically - i.e. when the value is changed in
the net_cls_classid file, this will also trigger an update to the
classid associated with all sockets controlled by the cgroup.
This mimics the behaviour of other cgroup subsystems.
net_prio circumvents this issue by storing an index into a table with
each socket (and so any updates to the table, don't require updating
the value associated with the socket). net_cls, however, passes the
socket the classid directly, and so this additional step is needed.
Signed-off-by: Nina Schiff <ninasc@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous patch changed nf_ct_frag6_gather() to morph reassembled skb
with the previous one.
This means that the return value is always NULL or the skb argument.
So change it to an err value.
Instead of invoking NF_HOOK recursively with threshold to skip already-called hooks
we can now just return NF_ACCEPT to move on to the next hook except for
-EINPROGRESS (which means skb has been queued for reassembly), in which case we
return NF_STOLEN.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
commit 6aafeef03b
("netfilter: push reasm skb through instead of original frag skbs")
changed ipv6 defrag to not use the original skbs anymore.
So rather than keeping the original skbs around just to discard them
afterwards just use the original skbs directly for the fraglist of
the newly assembled skb and remove the extra clone/free operations.
The skb that completes the fragment queue is morphed into a the
reassembled one instead, just like ipv4 defrag.
openvswitch doesn't need any additional skb_morph magic anymore to deal
with this situation so just remove that.
A followup patch can then also remove the NF_HOOK (re)invocation in
the ipv6 netfilter defrag hook.
Cc: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Eliminate some checkpatch issues by improved layout of if statements.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Change brace placement to eliminate checkpatch error.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>