1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

2399 commits

Author SHA1 Message Date
Denis Drozdov
f6a8a19bb1 RDMA/netdev: Hoist alloc_netdev_mqs out of the driver
netdev has several interfaces that expect to call alloc_netdev_mqs from
the core code, with the driver only providing the arguments.  This is
incompatible with the rdma_netdev interface that returns the netdev
directly.

Thus re-organize the API used by ipoib so that the verbs core code calls
alloc_netdev_mqs for the driver. This is done by allowing the drivers to
provide the allocation parameters via a 'get_params' callback and then
initializing an allocated netdev as a second step.

Fixes: cd565b4b51 ("IB/IPoIB: Support acceleration options callbacks")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Denis Drozdov <denisd@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-10 17:58:11 -07:00
Leon Romanovsky
ed7a01fd3f RDMA/restrack: Release task struct which was hold by CM_ID object
Tracking CM_ID resource is performed in two stages: creation of cm_id
and connecting it to the cma_dev. It is needed because rdma-cm protocol
exports two separate user-visible calls rdma_create_id and rdma_accept.

At the time of CM_ID creation, the real owner of that object is unknown
yet and we need to grab task_struct. This task_struct is released or
reassigned in attach phase later on. but call to rdma_destroy_id left
this task_struct unreleased.

Such separation is unique to CM_ID and other restrack objects initialize
in one shot. It means that it is safe to use "res->valid" check to catch
unfinished CM_ID flow and release task_struct for that object.

Fixes: 00313983cd ("RDMA/nldev: provide detailed CM_ID information")
Reported-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-05 16:07:39 -06:00
Leon Romanovsky
2165fc2640 RDMA/restrack: Consolidate task name updates in one place
Unify task update and kernel name set in one place.

Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-05 16:07:39 -06:00
Leon Romanovsky
363ad35577 RDMA/restrack: Un-inline set task implementation
Prepare rdma_restrack_set_task() call to accommodate more
code by moving its implementation from *.h to *.c.

Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-05 16:07:39 -06:00
Parav Pandit
fe33507ec3 RDMA/core: Check error status of rdma_find_ndev_for_src_ip_rcu
rdma_find_ndev_for_src_ip_rcu() returns either valid netdev pointer or
ERR_PTR().  Instead of checking for NULL, check for error.

Fixes: caf1e3ae9f ("RDMA/core Introduce and use rdma_find_ndev_for_src_ip_rcu")
Reported-by: syzbot+20c32fa6ff84a2d28c36@syzkaller.appspotmail.com
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-03 20:47:41 -06:00
Leon Romanovsky
38716732f1 RDMA/netlink: Simplify netlink listener existence check
All users of rdma_nl_chk_listeners() are interested to get boolean answer
if netlink socket has listeners, so update all places to boolean function.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-03 16:06:07 -06:00
Kamal Heib
d31131bba5 RDMA: Remove unused parameter from ib_modify_qp_is_ok()
The ll parameter is not used in ib_modify_qp_is_ok(), so remove it.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-03 16:05:46 -06:00
Jason Gunthorpe
e73798f20e RDMA/uverbs: Fix RCU annotation for radix slot deference
The uapi radix tree is a write-once data structure protected by kref.
Once we get to the ioctl() fop it is not possible for anything else
to be writing to it, so the access should use rcu_dereference_protected.

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-03 16:01:40 -06:00
Parav Pandit
41ab1cb7d1 RDMA/cma: Introduce and use cma_ib_acquire_dev()
When RDMA CM connect request arrives for IB transport, it already contains
device, port, netdevice (optional).

Instead of traversing all the cma devices, use the cma device already
found by the cma_find_listener() for which a listener id is provided.

iWarp devices doesn't need to derive RoCE GIDs, therefore drop RoCE
specific checks from cma_acquire_dev() and rename it to
cma_iw_acquire_dev().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-30 19:21:13 -06:00
Parav Pandit
ff11c6cd52 RDMA/cma: Introduce and use cma_acquire_dev_by_src_ip()
Light weight version of cma_acquire_dev() just for binding with rdma
device based on source IP(v4/v6) address.

This simplifies cma_acquire_dev() to avoid listen_id specific checks and
also for subsequent simplification for IB vs iWarp.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-30 19:21:13 -06:00
Parav Pandit
78fb282b15 RDMA/cma: Allow accepting requests for multi port rdma device
When IP failover is used between multiple ports of a given rdma device,
allow accepting CM requests from either of the ports.  This is applicable
for IPv4 and IPv6 non link local addressing scheme.

IPv6 link local addresses are bound. IP failover requests for listen
cm_ids bound to specific netdev interfaces cannot be supported.
(Similar to traditional sockets).

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-30 19:21:13 -06:00
Parav Pandit
3994586f4d RDMA/core: Acquire and release mmap_sem on page range
Currently mmap_sem is read locked while pinning the memory.  In a
multi-threaded application of a process, holding mmap_sem lock creates
contention with other threads who might be either registering memory,
creating QPs or simply doing mmap() as such operations also require to
hold the mmap_sem write lock.

All such operation cannot make forward progress until one memory pin
operation is completed.  It becomes more worse if the memory is unpinned
and/or memory registration is large (in GB range).

Therefore, instead of holding mmap_sem for too long (for whole region
pinning), acquire and release the lock for every few pages.  For example
on x86 with 4K page size, acquire and release mmap_sem for every 2Mbytes
memory chunk.

This allows other competing threads to make progress who might wish to
hold mmap_sem for shorter duration.

When memory registration latency is measured using [1] for memory sizes
ranging from 4K to 48GB, <= 1% or 0.5% degradation is noticed. In many
runs no difference is seen other than run-to-run variance.

In other targeted tests of users with large memory, desired improvements
are seen due to reduced contention of mmap_sem.

[1] https://github.com/paravmellanox/rtool

$ rdma_resource_lat -c 1 -s 48G -a -u L -i 500 -A

It registers pinned memory from 4K to 48GB size with 500 iterations for
each memory size.

$ rdma_resource_lat -c 1 -s 12G -a -u L -i 500 -t 4

4 competing threads pin memory, each of 12GB size with 500 iterations.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-27 12:40:20 -06:00
Alex Estrin
c8b53d0c5e IB/sa: simplify return code logic for ib_nl_send_msg()
rdma_nl_multicast() returns either negative error code
or zero if succeeded. Remove unnecessary ret code checks
and reassignments.

Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-26 16:35:48 -06:00
Jason Gunthorpe
896de0090a RDMA/core: Use dev_name instead of ibdev->name
These return the same thing but dev_name is a more conventional use of the
kernel API.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
2018-09-26 13:51:48 -06:00
Jason Gunthorpe
43c7c851b9 RDMA/core: Use dev_err/dbg/etc instead of pr_* + ibdev->name
Any messages related to a device should be printed with the dev_*
formatters. This provides greater consistency for the user.

The core does not set pr_fmt so this has no significant change.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
2018-09-26 13:51:48 -06:00
Jason Gunthorpe
e349f858d2 RDMA: Fully setup the device name in ib_register_device
The current code has two copies of the device name, ibdev->dev and
dev_name(&ibdev->dev), and they are setup at different times, which is
very confusing.

Set them both up at the same time and make dev_name() the lead name, which
is the proper use of the driver core APIs. To make it very clear that the
name is not valid until registration pass it in to the
ib_register_device() call rather than messing with ibdev->name directly.

Also the reorganization now checks that dev_name is unique even if it does
not contain a %.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Acked-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
2018-09-26 13:51:36 -06:00
Doug Ledford
c6ce580716 RDMA/umem: Fix potential addition overflow
Given a large enough memory allocation, it is possible to wrap the
pinned_vm counter.  Check for addition overflow to prevent such
eventualities.

Fixes: 40ddacf2dd ("RDMA/umem: Don't hold mmap_sem for too long")
Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-25 15:19:06 -06:00
Doug Ledford
3312d1c6bd RDMA/umem: Minor optimizations
Noticed while reviewing commit d4b4dd1b97 ("RDMA/umem: Do not use
current->tgid to track the mm_struct") patch.  Why would we take a lock,
adjust a protected variable, drop the lock, and *then* check the input
into our protected variable adjustment?  Then we have to take the lock
again on our error unwind.  Let's just check the input early and skip
taking the locks needlessly if the input isn't valid.

It was also noticed that we set mm = current->mm, we then never modify
mm, but we still go back and reference current->mm a number of times
needlessly.  Be consistent in using the stored reference in mm.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-25 15:19:06 -06:00
Parav Pandit
5c5702e259 RDMA/core: Set right entry state before releasing reference
Currently add_modify_gid() for IB link layer has followong issue
in cache update path.

When GID update event occurs, core releases reference to the GID
table without updating its state and/or entry pointer.

CPU-0                              CPU-1
------                             -----
ib_cache_update()                    IPoIB ULP
   add_modify_gid()                   [..]
      put_gid_entry()
      refcnt = 0, but
      state = valid,
      entry is valid.
      (work item is not yet executed).
                                   ipoib_create_ah()
                                     rdma_create_ah()
                                        rdma_get_gid_attr() <--
                                   	Tries to acquire gid_attr
                                        which has refcnt = 0.
                                   	This is incorrect.

GID entry state and entry pointer is provides the accurate GID enty
state. Such fields must be updated with rwlock to protect against
readers and, such fields must be in sane state before refcount can drop
to zero. Otherwise above race condition can happen leading to
use-after-free situation.

Following backtrace has been observed when cache update for an IB port
is triggered while IPoIB ULP is creating an AH.

Therefore, when updating GID entry, first mark a valid entry as invalid
through state and set the barrier so that no callers can acquired
the GID entry, followed by release reference to it.

refcount_t: increment on 0; use-after-free.
WARNING: CPU: 4 PID: 29106 at lib/refcount.c:153 refcount_inc_checked+0x30/0x50
Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
RIP: 0010:refcount_inc_checked+0x30/0x50
RSP: 0018:ffff8802ad36f600 EFLAGS: 00010082
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000008 RDI: ffffffff86710100
RBP: ffff8802d6e60a30 R08: ffffed005d67bf8b R09: ffffed005d67bf8b
R10: 0000000000000001 R11: ffffed005d67bf8a R12: ffff88027620cee8
R13: ffff8802d6e60988 R14: ffff8802d6e60a78 R15: 0000000000000202
FS: 0000000000000000(0000) GS:ffff8802eb200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3ab35e5c88 CR3: 00000002ce84a000 CR4: 00000000000006e0
IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready
Call Trace:
rdma_get_gid_attr+0x220/0x310 [ib_core]
? lock_acquire+0x145/0x3a0
rdma_fill_sgid_attr+0x32c/0x470 [ib_core]
rdma_create_ah+0x89/0x160 [ib_core]
? rdma_fill_sgid_attr+0x470/0x470 [ib_core]
? ipoib_create_ah+0x52/0x260 [ib_ipoib]
ipoib_create_ah+0xf5/0x260 [ib_ipoib]
ipoib_mcast_join_complete+0xbbe/0x2540 [ib_ipoib]

Fixes: b150c3862d ("IB/core: Introduce GID entry reference counts")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-25 15:01:09 -06:00
Mark Bloch
a9360abd3d IB/uverbs: Free uapi on destroy
Make sure we free struct uverbs_api once we clean the radix tree. It was
allocated by uverbs_alloc_api().

Fixes: 9ed3e5f447 ("IB/uverbs: Build the specs into a radix tree at runtime")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-25 14:47:33 -06:00
Jason Gunthorpe
2a3ccfdbeb RDMA/uverbs: Get rid of ucontext->tgid
Nothing uses this now, just delete it.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:58:36 -04:00
Jason Gunthorpe
56ac9dd917 RDMA/umem: Avoid synchronize_srcu in the ODP MR destruction path
synchronize_rcu is slow enough that it should be avoided on the syscall
path when user space is destroying MRs. After all the rework we can now
trivially do this by having call_srcu kfree the per_mm.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:58:36 -04:00
Jason Gunthorpe
be7a57b41a RDMA/umem: Handle a half-complete start/end sequence
mmu_notifier_unregister() can race between a invalidate_start/end and
cause the invalidate_end to be skipped. This causes an imbalance in the
locking, which lockdep complains about.

This is not actually a bug, as we immediately kfree the memory holding the
lock, but it simple enough to fix.

Mark when the notifier is being destroyed and abort the start callback.
This can be done under the lock we already obtained, and can re-purpose
the invalidate_range test we already have.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:58:36 -04:00
Jason Gunthorpe
ca748c39ea RDMA/umem: Get rid of per_mm->notifier_count
This is intrinsically racy and the scheme is simply unnecessary. New MR
registration can wait for any on going invalidation to fully complete.

      CPU0                              CPU1
                                  if (atomic_read())
 if (atomic_dec_and_test() &&
     !list_empty())
  { /* not taken */ }
                                       list_add()

Putting the new UMEM into some kind of purgatory until another invalidate
rolls through..

Instead hold the read side of the umem_rwsem across the pair'd start/end
and get rid of the racy 'deferred add' approach.

Since all umem's in the rbt are always ready to go, also get rid of the
mn_counters_active stuff.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:58:36 -04:00
Jason Gunthorpe
f27a0d50a4 RDMA/umem: Use umem->owning_mm inside ODP
Since ODP had a single struct mmu_notifier located in the ucontext it
could only handle a single MM at a time, and this prevented it from using
the new owning_mm system.

With the prior rework it is now simple to let ODP track multiple MMs per
ucontext, finish the job so that the per_mm is allocated on a mm by mm
basis, and freed when the last umem is dropped from the ucontext.

As a side effect the new saner locking removes the lockdep splat about
nesting the umem_rwsem between mmu_notifier_unregister and
ib_umem_odp_release.

It also makes ODP work with multiple processes, across, fork, etc.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:58:36 -04:00
Jason Gunthorpe
c9990ab39b RDMA/umem: Move all the ODP related stuff out of ucontext and into per_mm
This is the first step to make ODP use the owning_mm that is now part of
struct ib_umem.

Each ODP umem is linked to a single per_mm structure, which in turn, is
linked to a single mm, via the embedded mmu_notifier. This first patch
introduces the structure and reworks eveything to use it.

This also needs to introduce tgid into the ib_ucontext_per_mm, as
get_user_pages_remote() requires the originating task for statistics
tracking.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:54:46 -04:00
Jason Gunthorpe
597ecc5a09 RDMA/umem: Get rid of struct ib_umem.odp_data
This no longer has any use, we can use container_of to get to the
umem_odp, and a simple flag to indicate if this is an odp MR. Remove the
few remaining references to it.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:54:46 -04:00
Jason Gunthorpe
41b4deeaa1 RDMA/umem: Make ib_umem_odp into a sub structure of ib_umem
These two structures are linked together, use the container_of pattern
instead of a double allocation to make the code simpler and easier to
follow.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:54:46 -04:00
Jason Gunthorpe
b5231b019d RDMA/umem: Use ib_umem_odp in all function signatures connected to ODP
All of these functions already require the ODP version of the umem struct,
make this very clear by having the signature require it. This paves the
way to using the container_of() pattern to link umem_odp and umem
together.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-21 11:54:46 -04:00
Majd Dibbiny
4eeed36869 RDMA/uverbs: Fix validity check for modify QP
Uverbs shouldn't enforce QP state in the command unless the user set the QP
state bit in the attribute mask.

In addition, only copy qp attr fields which have the corresponding bit set
in the attribute mask over to the internal attr structure.

Fixes: 88de869bbe ("RDMA/uverbs: Ensure validity of current QP state value")
Fixes: bc38a6abdd ("[PATCH] IB uverbs: core implementation")
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-20 16:47:30 -06:00
Jason Gunthorpe
d4b4dd1b97 RDMA/umem: Do not use current->tgid to track the mm_struct
This is just wrong, the process that calls into the reg_mr is the process
associated with the umem, and that does not have to be the same process
that created the context.

When this code was first written mmgrab() didn't exist, however these days
we can just directly hold the mm_struct pointer in the umem and have no
ambiguity when it comes to releasing the umem as to which mm it was
associated with.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-20 16:19:30 -04:00
Jason Gunthorpe
ce92db1ca8 RDMA/ucontext: Get rid of the old disassociate flow
The disassociate_ucontext function in every driver is now empty, so we
don't need this ugly and wrong code that was messing with tgids.

rdma_user_mmap_io does this same work in a better way.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-20 16:19:30 -04:00
Jason Gunthorpe
5f9794dc94 RDMA/ucontext: Add a core API for mmaping driver IO memory
To support disassociation and PCI hot unplug, we have to track all the
VMAs that refer to the device IO memory. When disassociation occurs the
VMAs have to be revised to point to the zero page, not the IO memory, to
allow the physical HW to be unplugged.

The three drivers supporting this implemented three different versions
of this algorithm, all leaving something to be desired. This new common
implementation has a few differences from the driver versions:

- Track all VMAs, including splitting/truncating/etc. Tie the lifetime of
  the private data allocation to the lifetime of the vma. This avoids any
  tricks with setting vm_ops which Linus didn't like. (see link)
- Support multiple mms, and support properly tracking mmaps triggered by
  processes other than the one first opening the uverbs fd. This makes
  fork behavior of disassociation enabled drivers the same as fork support
  in normal drivers.
- Don't use crazy get_task stuff.
- Simplify the approach for to racing between vm_ops close and
  disassociation, fixing the related bugs most of the driver
  implementations had. Since we are in core code the tracking list can be
  placed in struct ib_uverbs_ufile, which has a lifetime strictly longer
  than any VMAs created by mmap on the uverbs FD.

Link: https://www.spinics.net/lists/stable/msg248747.html
Link: https://lkml.kernel.org/r/CA+55aFxJTV_g46AQPoPXen-UPiqR1HGMZictt7VpC-SMFbm3Cw@mail.gmail.com
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-20 16:19:30 -04:00
Jason Gunthorpe
0099103926 RDMA/uverbs: Fix error unwind in ib_uverbs_add_one
The error path has several mistakes

- cdev_del should not be called if cdev_device_add fails
- We must call put_device on all the goto exit paths as that is what frees
  the uapi, SRCU and the struct itself.

While we are here consolidate all the uvdev_dev init that cannot fail at
the top.

Fixes: c5c4d92e70 ("RDMA/uverbs: Use cdev_device_add() instead of cdev_add()")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
2018-09-19 10:49:22 -06:00
YueHaibing
0965cc953a RDMA/core: Properly return the error code of rdma_set_src_addr_rcu
rdma_set_src_addr_rcu should check copy_src_l2_addr fails, rather than
always return 0. Also copy_src_l2_addr should return 'ret' as its return
value when rdma_translate_ip fails.

Fixes: c31d4b2ddf ("RDMA/core: Protect against changing dst->dev during destination resolve")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-19 10:12:44 -06:00
Jason Gunthorpe
6ebce44746 RDMA/uverbs: Remove is_closed from ib_uverbs_file
This does nothing but indicate if the uverbs_file is in the device's list,
use list_del_init instead.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-09-19 10:07:05 -06:00
Cong Wang
5fe23f262e ucma: fix a use-after-free in ucma_resolve_ip()
There is a race condition between ucma_close() and ucma_resolve_ip():

CPU0				CPU1
ucma_resolve_ip():		ucma_close():

ctx = ucma_get_ctx(file, cmd.id);

        list_for_each_entry_safe(ctx, tmp, &file->ctx_list, list) {
                mutex_lock(&mut);
                idr_remove(&ctx_idr, ctx->id);
                mutex_unlock(&mut);
		...
                mutex_lock(&mut);
                if (!ctx->closing) {
                        mutex_unlock(&mut);
                        rdma_destroy_id(ctx->cm_id);
		...
                ucma_free_ctx(ctx);

ret = rdma_resolve_addr();
ucma_put_ctx(ctx);

Before idr_remove(), ucma_get_ctx() could still find the ctx
and after rdma_destroy_id(), rdma_resolve_addr() may still
access id_priv pointer. Also, ucma_put_ctx() may use ctx after
ucma_free_ctx() too.

ucma_close() should call ucma_put_ctx() too which tests the
refcnt and waits for the last one releasing it. The similar
pattern is already used by ucma_destroy_id().

Reported-and-tested-by: syzbot+da2591e115d57a9cbb8b@syzkaller.appspotmail.com
Reported-by: syzbot+cfe3c1e8ef634ba8964b@syzkaller.appspotmail.com
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-09-13 13:04:13 -04:00
Parav Pandit
0e9d2c19bf RDMA/core: Consider net ns of gid attribute for RoCE
When resolving destination address or route, when net namespace is
unavailable, refer to the net namespace of the netdevice of the SGID
attribute. This is typically the case for requests arriving from the
network for RoCE ports.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 16:32:17 -06:00
Parav Pandit
d6b1764a8c RDMA/core: Introduce rdma_read_gid_attr_ndev_rcu() to check GID attribute
Introduce an API rdma_read_gid_attr_ndev_rcu() to return GID attribute
netdevice which is in UP state for accessing netdevice's fields such as
net namespace and ifindex.

This is useful for users who intent to access netdevice fields under rcu
lock.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 16:32:17 -06:00
Parav Pandit
6aaecd3856 RDMA/core: Simplify roce_resolve_route_from_path()
Currently RoCE route resolve functionality is split between two
functions. (a) roce_resolve_route_from_path() and its helper function
rdma_resolve_ip_route().

Due to this multiple sockaddr src structures are created in both functions
with rdma_dev_addr is an interface between the two for checks.

Since there is only one user of rdma_resolve_ip_route() as RoCE, combine
the functionality of both functions to roce_resolve_route_from_path() and
further reduce the scope of rdma_dev_addr to core/addr.c

This also allow to extend addr_resolve() in subsequent patch to consider
netdev properties of GID in safer way under rcu lock.

Additionally src and dst addresses were always provided, so skip the src
addr NULL pointer check as they are present on the stack now.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 16:32:17 -06:00
Parav Pandit
c31d4b2ddf RDMA/core: Protect against changing dst->dev during destination resolve
During resolving address process, during route lookup and while performing
src address translation in case of loopback mode, hold the rcu lock so
that if netdevice is moving to different net namespace, or being
unregistered, it can be synchronized with net/core/dev.c, ie

change_net_namespace()
->dev_close_many()
  ->rt6_uncached_list_flush_dev() who would change dst->dev

to loopback device of the given net namespace.

Therefore, hold the rcu lock and sync with synchronize_net() of
change_net_namespace() to ensure that netdevice cannot get freed while
dst->dev is being used.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 16:32:16 -06:00
Parav Pandit
307edde8ef RDMA/core: Refer to network type instead of device type
Set and refer to rdma_dev_addr network type instead of dst->ndev to reduce
dependency on accessing dst netdevice.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
783793b554 RDMA/core: Use common code flow for IPv4/6 for addr resolve
Use common code flow for resolving neighbour and for finding source
addresses.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
77addc5244 RDMA/core: Rename rdma_copy_addr to rdma_copy_src_l2_addr
Now that rdma_copy_addr() only copies the source addresses and all callers
are interested in copying only source addresses, simplify it to drop the
destination address argument.

Given that it only copies source layer2 addresses, rename it to
rdma_copy_src_l2_addr for better code readability.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
a362ea1d9e RDMA/core: Introduce and use rdma_set_src_addr() between IPv4 and IPv6
rdma_translate_ip() is done while resolving address for the loopback
addresses. The current flow is convoluted with resolve neighbor being
optional.

This patch simplifies the code in following ways.

(a) Use common code between IPv4 and IPv6 for address translation,
    loopback checks and acquiring netdevice.
(b) During neigh resolve in addr_resolve_neigh(), only copy destination
    address.
(c) Always resolve the source address before the destination address,
    because it doesn't depend on resolving neigh being requested or not.

This helps to reduce 3 calls of rdma_copy_addr and rdma_translate_ip to
one and makes it easier to follow the code flow.

Now that ib_nl_fetch_ha() doesn't depend on dst, drop dst argument from
ib_nl_fetch_ha().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
89c5691cdd RDMA/core: Let protocol specific function typecast sockaddr structure
Current code typecasts destination address using extra variable but uses
source address as is.

Even though the compiler optimizes such code well, just let each protocol
specific function typecast for src and dest both and have symmetric code.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
f89b7dfa33 RDMA/core: Avoid unnecessary sa_family overwrite
addr4_resolve() and addr6_resolve() are called by checking the value of
sa_family.

Both above functions overwrite the value after typecasting, this is not
necessary.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Parav Pandit
caf1e3ae9f RDMA/core Introduce and use rdma_find_ndev_for_src_ip_rcu
This fixes two issues:
1. When address family is other than IPv4 or v6, rdma_translate_ip()
   returns success which is incorrect.
2. When address familty is AF_INET6, and if the source address is not
   found, it returns success, which is also incorrect.

Therefore, introduce and use rdma_find_ndev_for_src_ip_rcu() helper
function which returns correct success or error status and is also useful
for future code refactor in addr_resolve().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:48:08 -06:00
Steve Wise
67e3816842 RDMA/uverbs: Atomically flush and mark closed the comp event queue
Currently a uverbs completion event queue is flushed of events in
ib_uverbs_comp_event_close() with the queue spinlock held and then
released.  Yet setting ev_queue->is_closed is not set until later in
uverbs_hot_unplug_completion_event_file().

In between the time ib_uverbs_comp_event_close() releases the lock and
uverbs_hot_unplug_completion_event_file() acquires the lock, a completion
event can arrive and be inserted into the event queue by
ib_uverbs_comp_handler().

This can cause a "double add" list_add warning or crash depending on the
kernel configuration, or a memory leak because the event is never dequeued
since the queue is already closed down.

So add setting ev_queue->is_closed = 1 to ib_uverbs_comp_event_close().

Cc: stable@vger.kernel.org
Fixes: 1e7710f3f6 ("IB/core: Change completion channel to use the reworked objects schema")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-12 15:43:15 -06:00
Mark Bloch
fa76d24ee0 RDMA/mlx5: Add flow actions support to raw create flow
Support attaching flow actions to a flow rule via raw create flow.
For now only NIC RX path is supported. This change requires to export
flow resources management functions so we can maintain proper bookkeeping
of flow actions.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-09-11 09:28:07 -06:00