This is a preparation patch to provide isolation of rdma device in a
network namespace.
As first step, make rdma device visible only in init net namespace.
Subsequent patch will enable rdma device visibility back in multiple net
namespaces using compat ib_core_device device/sysfs tree.
Given that the IB subsystem depends on net stack, it needs to be
initialized after netdev and since it support devices, it needs to be
initialized before the device subsystem; therefore, change initcall
sequence to fs_initcall, so that when ib_core is compiled in the kernel
image, the right init sequence is followed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In order to support sysfs entries in multiple net namespaces for a rdma
device, introduce a ib_core_device whose scope is limited to hold core
device and per port sysfs related entries.
This is preparation patch so that multiple ib_core_devices in each net
namespace can be created in subsequent patch who all can share ib_device.
(a) Move sysfs specific fields to ib_core_device.
(b) Make sysfs and device life cycle related routines to work on
ib_core_device.
(c) Introduce and use rdma_init_coredev() helper to initialize
coredev fields.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch avoids that sparse reports the following warnings:
drivers/infiniband/core/uverbs_std_types_flow_action.c:442:30: warning: symbol 'uverbs_def_obj_flow_action' was not declared. Should it be static?
drivers/infiniband/core/uverbs_std_types_dm.c:112:30: warning: symbol 'uverbs_def_obj_dm' was not declared. Should it be static?
drivers/infiniband/core/uverbs_std_types_counters.c:153:30: warning: symbol 'uverbs_def_obj_counters' was not declared. Should it be static?
drivers/infiniband/core/uverbs_std_types_mr.c:213:30: warning: symbol 'uverbs_def_obj_mr' was not declared. Should it be static?
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Fixes: 0bd01f3d09 ("RDMA/uverbs: Require all objects to have a driver destroy function")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch avoids that sparse complains about a mismatch between the
returned value and the function return type.
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Fixes: c3bea3d2dc ("RDMA/uverbs: Use the iterator for ib_uverbs_unmarshall_recv()")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch avoids that sparse and smatch report the following:
warning: cast removes address space of expression
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Fixes: 3a6532c9af ("RDMA/uverbs: Use uverbs_attr_bundle to pass udata for write")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Decode more information from the packet and include it in the trace.
Reviewed-by: "Ruhl, Michael J" <michael.j.ruhl@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Trace MADs going to/from user space.
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Trace agent details when agents are [un]registered. In addition, report
agent details on send/recv.
Reviewed-by: "Ruhl, Michael J" <michael.j.ruhl@intel.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Trace received MAD details.
Reviewed-by: "Ruhl, Michael J" <michael.j.ruhl@intel.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use the standard Linux trace mechanism to trace MADs being sent. 4 trace
points are added, when the MAD is posted to the qp, when the MAD is
completed, if a MAD is resent, and when the MAD completes in error.
Reviewed-by: "Ruhl, Michael J" <michael.j.ruhl@intel.com>
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
No device supports ODP MR without an invalidate_range callback.
Warn on any any device which attempts to support ODP without supplying
this callback.
Then we can remove the checks for the callback within the code.
This stems from the discussion
https://www.spinics.net/lists/linux-rdma/msg76460.html
...which concluded this code was no longer necessary.
Acked-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Also introduce cm_local_id() to reduce the amount of boilerplate when
converting a local ID to an XArray index.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Pull the allocation function out into its own function to reduce the
length of ib_register_mad_agent() a little and keep all the allocation
logic together.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
"__attribute__" set of macros has been standardized, have became more
potentially portable and consistent code back in v2.6.21 by commit
82ddcb040 ("[PATCH] extend the set of "__attribute__" shortcut macros").
Moreover, nowadays checkpatch.pl warns about using __attribute__((packed))
instead of __packed.
This patch converts all the "__attribute__ ((packed))" annotations to
"__packed" within the RDMA subsystem.
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
-----BEGIN PGP SIGNATURE-----
iQFIBAABCgAyFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAlyHF2oUHHdpbGx5QGlu
ZnJhZGVhZC5vcmcACgkQDpNsjXcpgj5j9AgAlpeptRfnPO0+VXj+EbxaOOI8tOG+
w+vBasWoQB+lZ9ctf1qUQVSeLn0ErxTM7BaIP7plfDrEWiIbRWkV18B+heS5d1Yz
aTV1d/8tG6/eo61K2VqXHbUhymgMtbXDsg1rwWTF8+Q4xIcMqfYAR0f9ptU1Oejc
pNAn16dYgKi6+4eluY7gXxruBosQ6yNml6iEje9A3uR8nhzTI/P3Yf2GGIZnQLsL
+UIx4Ps38dJ3VCYBPfbnszZfYPpILUH9/Bdx+mAMUtZwvpM3JYqc8XsiFfqDO7n1
3003yUytnRkb1UK3QIvkbPt0G8UOI4s9fxRPsA8lLSww/f2y1r5kC4Mxbg==
=HSP/
-----END PGP SIGNATURE-----
Merge tag 'xarray-5.1-rc1' of git://git.infradead.org/users/willy/linux-dax
Pull XArray updates from Matthew Wilcox:
"This pull request changes the xa_alloc() API. I'm only aware of one
subsystem that has started trying to use it, and we agree on the fixup
as part of the merge.
The xa_insert() error code also changed to match xa_alloc() (EEXIST to
EBUSY), and I added xa_alloc_cyclic(). Beyond that, the usual
bugfixes, optimisations and tweaking.
I now have a git tree with all users of the radix tree and IDR
converted over to the XArray that I'll be feeding to maintainers over
the next few weeks"
* tag 'xarray-5.1-rc1' of git://git.infradead.org/users/willy/linux-dax:
XArray: Fix xa_reserve for 2-byte aligned entries
XArray: Fix xa_erase of 2-byte aligned entries
XArray: Use xa_cmpxchg to implement xa_reserve
XArray: Fix xa_release in allocating arrays
XArray: Mark xa_insert and xa_reserve as must_check
XArray: Add cyclic allocation
XArray: Redesign xa_alloc API
XArray: Add support for 1s-based allocation
XArray: Change xa_insert to return -EBUSY
XArray: Update xa_erase family descriptions
XArray tests: RCU lock prohibits GFP_KERNEL
The previous attempted bug fix overlooked the fact that
ib_umem_odp_map_dma_single_page() was doing a put_page() upon hitting an
error. So there was not really a bug there.
Therefore, this reverts the off-by-one change, but keeps the change to use
release_pages() in the error path.
Fixes: 75a3e6a3c1 ("RDMA/umem: minor bug fix in error handling path")
Suggested-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
1. Bug fix: fix an off by one error in the code that cleans up if it fails
to dma-map a page, after having done a get_user_pages_remote() on a
range of pages.
2. Refinement: for that same cleanup code, release_pages() is better than
put_page() in a loop.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need to call kfree(pd) because ib_dealloc_pd() internally
frees PD.
Fixes: 21a428a019 ("RDMA: Handle PD allocations by IB/core")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Following the PD conversion patch, do the same for ucontext allocations.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The first parameter of WARN_ONCE() is a condition, then following
parameters are the message. In this case, we left out the condition so it
will just print the ops->type string.
Fixes: 3856ec4b93 ("RDMA/core: Add RDMA_NLDEV_CMD_NEWLINK/DELLINK support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
It is possible that during a page fault handling, the process that owns
the MR is terminating. The indication for it is failure to get the
task_struct or take reference on the mm_struct. In this case just abort
the page-fault handler with error but without a warning to the kernel log.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Before calling the provider's alloc_mw function, verify that the
given memory type is either IB_MW_TYPE_1 or IB_MW_TYPE_2.
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The strlen() check at the beginning of iw_cm_map() ensures that devname
and ifname strings are less than destinations to which they are supposed
to be copied. Change strncpy() call to be strcpy(), because we are
protected from overflow. Zero the entire string buffer to avoid copying
uninitialized kernel stack memory to userspace.
This fixes the compilation warning below:
In file included from ./include/linux/dma-mapping.h:6,
from drivers/infiniband/core/iwcm.c:38:
In function _strncpy_,
inlined from _iw_cm_map_ at drivers/infiniband/core/iwcm.c:519:2:
./include/linux/string.h:253:9: warning: ___builtin_strncpy_ specified
bound 32 equals destination size [-Wstringop-truncation]
return __builtin_strncpy(p, q, size);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fixes: d53ec8af56 ("RDMA/iwcm: Don't copy past the end of dev_name() string")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
old_pd is used only if IB_MR_REREG_PD flags is set.
For readability move it's initialization to where it is used.
While there rewrite the whole 'if-else' block so on error jump directly
to label and no need for 'else'
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add support for new LINK messages to allow adding and deleting rdma
interfaces. This will be used initially for soft rdma drivers which
instantiate device instances dynamically by the admin specifying a netdev
device to use. The rdma_rxe module will be the first user of these
messages.
The design is modeled after RTNL_NEWLINK/DELLINK: rdma drivers register
with the rdma core if they provide link add/delete functions. Each driver
registers with a unique "type" string, that is used to dispatch messages
coming from user space. A new RDMA_NLDEV_ATTR is defined for the "type"
string. User mode will pass 3 attributes in a NEWLINK message:
RDMA_NLDEV_ATTR_DEV_NAME for the desired rdma device name to be created,
RDMA_NLDEV_ATTR_LINK_TYPE for the "type" of link being added, and
RDMA_NLDEV_ATTR_NDEV_NAME for the net_device interface to use for this
link. The DELLINK message will contain the RDMA_NLDEV_ATTR_DEV_INDEX of
the device to delete.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Since rxe allows unregistration from other threads the rxe pointer can
become invalid any moment after ib_register_driver returns. This could
cause a user triggered use after free.
Add another driver callback to be called right after the device becomes
registered to complete any device setup required post-registration. This
callback has enough core locking to prevent the device from becoming
unregistered.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rxe has an open coded version of this that is not as safe as the core
version. This lets us eliminate the internal device list entirely from
rxe.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These APIs are intended to support drivers that exist outside the usual
driver core probe()/remove() callbacks. Normally the driver core will
prevent remove() from running concurrently with probe(), once this safety
is lost drivers need more support to get the locking and lifetimes right.
ib_unregister_driver() is intended to be used during module_exit of a
driver using these APIs. It unregisters all the associated ib_devices.
ib_unregister_device_and_put() is to be used by a driver-specific removal
function (ie removal by name, removal from a netdev notifier, removal from
netlink)
ib_unregister_queued() is to be used from netdev notifier chains where
RTNL is held.
The locking is tricky here since once things become async it is possible
to race unregister with registration. This is largely solved by relying on
the registration refcount, unregistration will only ever work on something
that has a positive registration refcount - and then an unregistration
mutex serializes all competing unregistrations of the same device.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Several drivers need to find the ib_device from a given netdev. rxe needs
this at speed in an unsleepable context, so choose to implement the
translation using a RCU safe hash table.
The hash table can have a many to one mapping. This is intended to support
some future case where multiple IB drivers (ie iWarp and RoCE) connect to
the same netdevs. driver_ids will need to be different to support this.
In the process this makes the struct ib_device and ib_port_data RCU safe
by deferring their kfrees.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The associated netdev should not actually be very dynamic, so for most
drivers there is no reason for a callback like this. Provide an API to
inform the core code about the net dev affiliation and use a core
maintained data structure instead.
This allows the core code to be more aware of the ndev relationship which
will allow some new APIs based around this.
This also uses locking that makes some kind of sense, many drivers had a
confusing RCU lock, or missing locking which isn't right.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Like the other cases there no real reason to have another array just for
the cache. This larger conversion gets its own patch.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no reason to have three allocations of per-port data. Combine
them together and make the lifetime for all the per-port data match the
struct ib_device.
Following patches will require more port-specific data, now there is a
good place to put it.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
We have many loops iterating over all of the end port numbers on a struct
ib_device, simplify them with a for_each helper.
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Netlink dumpit handshake exchanges the index from which kernel should
start to return its value, in current code, this index included
not-visible in this PID items too and indirectly revealed the number of
entries.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds ability to query specific QP based on its LQPN (local
QPN), which is assigned by HW and needs special treatment while inserting
into restrack DB.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
PD, MR and QP objects have parents objects: contexts and PDs. The exposed
parent IDs allow to correlate various objects and simplify debug
investigation.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Give to the user space tools unique identifier for PD, MR, CQ and CM_ID
objects, so they can be able to query on them with .doit callbacks.
QP .doit is not supported yet, till all drivers will be updated to provide
their LQPN to be equal to their restrack ID.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
As a preparation to extension of rdma_restrack_root to provide software
IDs, which will be per-type too. We convert the rdma_restrack_root from
struct with arrays to array of structs.
Such conversion allows us to drop rwsem lock in favour of internal XArray
lock.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need to expose internals of restrack DB to IB/core.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
XArray uses internal lock for updates to XArray. This means that our
external RW lock is needed to ensure that entry is not deleted while we
are performing iteration over list.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Implement doit callbacks and ensure that users won't provide port values
on resource entry allocated in per-device mode needed for .doit callback.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add new general helper to get restrack entry given by ID and their
respective type.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The additions of .doit callbacks posses new access pattern to the resource
entries by some user visible index. Back then, the legacy DB was
implemented as hash because per-index access wasn't needed and XArray
wasn't accepted yet.
Acceptance of XArray together with per-index access requires the refresh
of DB implementation.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Move core device addition and removal from sysfs.c to device.c as device.c
is more appropriate place for device management.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Refactor code for device and port sysfs attributes for reuse.
While at it, rename counter part free function to ib_free_port_attrs.
Also attribute setup sequence is:
(a) port specific init.
(b) device stats alloc/init.
So for cleanup, follow reverse sequence:
(a) device stats dealloc
(b) port specific cleanup
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of holding extra reference using get_device() that
device_unregister() releases, simplify it as below.
device_add() balances with device_del(). device_initialize() balances
with put_device(), always via ib_dealloc_device().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>