- Rewrite how munlock works to massively reduce the contention
on i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4ucgACgkQDpNsjXcp
gj69Wgf6AwqwmO5Tmy+fLScDPqWxmXJofbocae1kyoGHf7Ui91OK4U2j6IpvAr+g
P/vLIK+JAAcTQcrSCjymuEkf4HkGZOR03QQn7maPIEe4eLrZRQDEsmHC1L9gpeJp
s/GMvDWiGE0Tnxu0EOzfVi/yT+qjIl/S8VvqtCoJv1HdzxitZ7+1RDuqImaMC5MM
Qi3uHag78vLmCltLXpIOdpgZhdZexCdL2Y/1npf+b6FVkAJRRNUnA0gRbS7YpoVp
CbxEJcmAl9cpJLuj5i5kIfS9trr+/QcvbUlzRxh4ggC58iqnmF2V09l2MJ7YU3XL
v1O/Elq4lRhXninZFQEm9zjrri7LDQ==
=n9Ad
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Rewrite how munlock works to massively reduce the contention on
i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew
Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
mm/damon: minor cleanup for damon_pa_young
selftests/vm/transhuge-stress: Support file-backed PMD folios
mm/filemap: Support VM_HUGEPAGE for file mappings
mm/readahead: Switch to page_cache_ra_order
mm/readahead: Align file mappings for non-DAX
mm/readahead: Add large folio readahead
mm: Support arbitrary THP sizes
mm: Make large folios depend on THP
mm: Fix READ_ONLY_THP warning
mm/filemap: Allow large folios to be added to the page cache
mm: Turn can_split_huge_page() into can_split_folio()
mm/vmscan: Convert pageout() to take a folio
mm/vmscan: Turn page_check_references() into folio_check_references()
mm/vmscan: Account large folios correctly
mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
mm/vmscan: Free non-shmem folios without splitting them
mm/rmap: Constify the rmap_walk_control argument
mm/rmap: Convert rmap_walk() to take a folio
mm: Turn page_anon_vma() into folio_anon_vma()
mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
...
This code checks "index" for an upper bound but it does not check for
negatives. Change the type to unsigned to prevent underflows.
Fixes: 3c3c1f1416 ("RDMA/nldev: Allow optional-counter status configuration through RDMA netlink")
Link: https://lore.kernel.org/r/20220316083948.GC30941@kili
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
During READ/WRITE preparation, in case of failure in memory registration
using iser_reg_mem_fastreg we must unmap previously mapped iser task.
Link: https://lore.kernel.org/r/20220308145546.8372-5-mgurtovoy@nvidia.com
Reviewed-by: Sergey Gorenko <sergeygo@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Avoid code duplication and add the mapping/unmapping of the protection
buffers to the iser_dma_map_task_data/iser_dma_unmap_task_data functions.
Link: https://lore.kernel.org/r/20220308145546.8372-4-mgurtovoy@nvidia.com
Reviewed-by: Sergey Gorenko <sergeygo@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
After removing the FMR support in iSER, there is only one type of
registration context. Replace the void pointer with the explicit structure
for registration (struct iser_fr_desc).
Link: https://lore.kernel.org/r/20220308145546.8372-3-mgurtovoy@nvidia.com
Reviewed-by: Sergey Gorenko <sergeygo@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Open coding it makes the code more readable and simple.
Link: https://lore.kernel.org/r/20220308145546.8372-2-mgurtovoy@nvidia.com
Reviewed-by: Sergey Gorenko <sergeygo@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Rename rxe_add_ref() to rxe_get() and rxe_drop_ref() to rxe_put().
Significantly improves readability for new readers.
Link: https://lore.kernel.org/r/20220304000808.225811-10-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently the rxe driver uses red-black trees to add indices to the rxe
object pools. Linux xarrays provide a better way to implement the same
functionality for indices. This patch replaces red-black trees by xarrays
for pool objects. Since xarrays already have a spinlock use that in place
of the pool rwlock. Make sure that all changes in the xarray(index) and
kref(ref counnt) occur atomically.
Link: https://lore.kernel.org/r/20220304000808.225811-9-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move the maximum number of elements from a parameter in rxe_pool_init to a
member of the rxe_type_info array.
Link: https://lore.kernel.org/r/20220304000808.225811-7-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is only one remaining object type that allocates its own memory,
that is mr. So the sense of RXE_POOL_NO_ALLOC is changed to
RXE_POOL_ALLOC. Add checks to rxe_alloc() and rxe_add_to_pool() to make
sure the correct call is used for the setting of this flag.
Link: https://lore.kernel.org/r/20220304000808.225811-4-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently rxe saves a copy of MR in responder resources for RDMA reads.
Since the responder resources are never freed just over written if more
are needed this MR may not have a reference freed until the QP is
destroyed. This patch uses the rkey instead of the MR and on subsequent
packets of a multipacket read reply message it looks up the MR from the
rkey for each packet. This makes it possible for a user to deregister an
MR or unbind a MW on the fly and get correct behaviour.
Link: https://lore.kernel.org/r/20220304000808.225811-3-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The commit referenced below can take a reference to the AH which is never
dropped. This only happens in the UD request path. This patch optionally
passes that AH back to the caller so that it can hold the reference while
the AV is being accessed and then drop it. Code to do this is added to
rxe_req.c. The AV is also passed to rxe_prepare in rxe_net.c as an
optimization.
Fixes: e2fe06c908 ("RDMA/rxe: Lookup kernel AH from ah index in UD WQEs")
Link: https://lore.kernel.org/r/20220304000808.225811-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Before destroying MPT, the reserved loopback QPs send loopback IOs (one
write operation per SL). Completing these loopback IOs represents that
there isn't any outstanding request in MPT, then it's safe to destroy MPT.
Link: https://lore.kernel.org/r/20220310042835.38634-1-liangwenpeng@huawei.com
Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Address handles (AH) are a limited HW resource and some user applications
may create large numbers of identical AH's. Avoid running out of AH's by
reusing existing identical ones.
Link: https://lore.kernel.org/r/20220228183650.290-1-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In case the second xa_insert() fails, the obj_event is not released. Fix
the error unwind flow to free that memory to avoid a memory leak.
Fixes: 7597385371 ("IB/mlx5: Enable subscription for device events over DEVX")
Link: https://lore.kernel.org/r/1647018361-18266-1-git-send-email-lyz_cs@pku.edu.cn
Signed-off-by: Yongzhi Liu <lyz_cs@pku.edu.cn>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The argument 'payload' is not used in update_state(), so just remove it.
Link: https://lore.kernel.org/r/20220307145047.3235675-2-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The type of wqe length is u32 so in order to avoid overflow and shadow
casting change variable and relevant function argument to proper type.
Link: https://lore.kernel.org/r/20220307145047.3235675-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
My static checker complains that:
drivers/infiniband/hw/irdma/ctrl.c:3605 irdma_sc_ceq_init()
warn: can subtract underflow 'info->dev->hmc_fpm_misc.max_ceqs'?
It appears that "info->dev->hmc_fpm_misc.max_ceqs" comes from the firmware
in irdma_sc_parse_fpm_query_buf() so, yes, there is a chance that it could
be zero. Even if we trust the firmware, it's easy enough to change the
condition just as a hardenning measure.
Fixes: 3f49d68425 ("RDMA/irdma: Implement HW Admin Queue OPs")
Link: https://lore.kernel.org/r/20220307125928.GE16710@kili
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move the debugfs entry pointers under priv to their own struct.
Add get function for device debugfs root.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Abstract the alloc_cqc() into several parts and separate the process
unrelated to allocating CQC.
Link: https://lore.kernel.org/r/20220302064830.61706-10-liangwenpeng@huawei.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Abstract the alloc_srqc() into several parts and separate the alloc_srqn()
from the alloc_srqc().
Link: https://lore.kernel.org/r/20220302064830.61706-9-liangwenpeng@huawei.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
hns_roce_alloc_cmd_mailbox() never returns NULL, so the check should be
IS_ERR(). And the error code should be converted as the function's return
value.
Link: https://lore.kernel.org/r/20220302064830.61706-8-liangwenpeng@huawei.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The current mailbox functions have too many parameters, making the code
difficult to maintain. So construct a new structure mbox_msg to pass the
information needed by mailbox.
Link: https://lore.kernel.org/r/20220302064830.61706-6-liangwenpeng@huawei.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The "op" field of the mailbox occupies 8 bits, so the parameter "op"
should be of type u8.
Link: https://lore.kernel.org/r/20220302064830.61706-5-liangwenpeng@huawei.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The parameter "out_param" of the mailbox is always null when the context is
destroyed. So remove the function parameter "mailbox".
Link: https://lore.kernel.org/r/20220302064830.61706-4-liangwenpeng@huawei.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The value of the function parameter “timeout” is unique. Therefore,
it is unnecessary to specify the parameter “timeout” value each time.
So remove it.
Link: https://lore.kernel.org/r/20220302064830.61706-3-liangwenpeng@huawei.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The parameter "op_modifier" is only used for HIP06. It is useless for HIP08
and later versions. After removing HIP06, this parameter is no longer used,
so remove it.
Link: https://lore.kernel.org/r/20220302064830.61706-2-liangwenpeng@huawei.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ib_destroy_qp() would called by ib_create_qp_user() if error, the former
contains ib_qp_usecnt_dec(), but ib_qp_usecnt_inc() was not called before.
So move ib_qp_usecnt_inc() into create_qp().
Fixes: d2b10794fc ("RDMA/core: Create clean QP creations interface for uverbs")
Link: https://lore.kernel.org/r/20220303024232.2847388-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The AIP code signals the phys_mtu in the following query_port()
fragment:
props->phys_mtu = HFI1_CAP_IS_KSET(AIP) ? hfi1_max_mtu :
ib_mtu_enum_to_int(props->max_mtu);
Using the largest MTU possible should not depend on AIP.
Fix by unconditionally using the hfi1_max_mtu value.
Fixes: 6d72344cf6 ("IB/ipoib: Increase ipoib Datagram mode MTU's upper limit")
Link: https://lore.kernel.org/r/1644348309-174874-1-git-send-email-mike.marciniszyn@cornelisnetworks.com
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.
Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Saeed Mahameed says:
====================
mlx5-next 2022-22-02
The following PR includes updates to mlx5-next branch:
Headlines:
==========
1) Jakub cleans up unused static inline functions
2) I did some low level firmware command interface return status changes to
provide the caller with full visibility on the error/status returned by
the Firmware.
3) Use the new command interface in RDMA DEVX usecases to avoid flooding
dmesg with some "expected" user error prone use cases.
4) Moshe also uses the new command interface to grab the specific error
code from MFRL register command to provide the exact error reason for
why SW reset couldn't perform internally in FW.
5) From Mark Bloch: Lag, drop packets in hardware when possible
In active-backup mode the inactive interface's packets are dropped by the
bond device. In switchdev where TC rules are offloaded to the FDB
this can lead to packets being hit in the FDB where without offload
they would have been dropped before reaching TC rules in the kernel.
Create a drop rule to make sure packets on inactive ports are dropped
before reaching the FDB.
Listen on NETDEV_CHANGEUPPER / NETDEV_CHANGEINFODATA events and record
the inactive state and offload accordingly.
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add clarification on sync reset failure
net/mlx5: Add reset_state field to MFRL register
RDMA/mlx5: Use new command interface API
net/mlx5: cmdif, Refactor error handling and reporting of async commands
net/mlx5: Use mlx5_cmd_do() in core create_{cq,dct}
net/mlx5: cmdif, Add new api for command execution
net/mlx5: cmdif, cmd_check refactoring
net/mlx5: cmdif, Return value improvements
net/mlx5: Lag, offload active-backup drops to hardware
net/mlx5: Lag, record inactive state of bond device
net/mlx5: Lag, don't use magic numbers for ports
net/mlx5: Lag, use local variable already defined to access E-Switch
net/mlx5: E-switch, add drop rule support to ingress ACL
net/mlx5: E-switch, remove special uplink ingress ACL handling
net/mlx5: E-Switch, reserve and use same uplink metadata across ports
net/mlx5: Add ability to insert to specific flow group
mlx5: remove unused static inlines
====================
Link: https://lore.kernel.org/r/20220223233930.319301-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The rdma_zalloc_drv_obj() in __ib_alloc_pd() would zero pd, it unnecessary
add NULL to the object in struct pd.
The uverbs_free_pd() already return busy if pd->usecnt is true, there is
no need to add a warning.
Link: https://lore.kernel.org/r/20220223074901.201506-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The PD id is masked with 0x7fff, while PD can be 18 bits for GEN2 HW.
Remove the masking as it should not be needed and can cause incorrect PD
id to be used.
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Link: https://lore.kernel.org/r/20220225163211.127-4-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Using PCI_FUNC macro in a VM, when the device is in passthrough mode does
not provide the real function instance. This means that currently, devices
will not probe unless the instance in the VM matches the instance in the
host.
Fix this by getting the pf_id from the LAN during the probe.
Fixes: 8498a30e1b ("RDMA/irdma: Register auxiliary driver and implement private channel OPs")
Link: https://lore.kernel.org/r/20220225163211.127-3-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently, events on vlan netdevs are being ignored. Fix this by finding
the real netdev and processing the notifications for vlan netdevs.
Fixes: 915cc7ac0f ("RDMA/irdma: Add miscellaneous utility definitions")
Link: https://lore.kernel.org/r/20220225163211.127-2-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The function irdma_create_mg_ctx always returns 0, so make it void and
delete the return value check.
Link: https://lore.kernel.org/r/20220224182832.3896686-1-yanjun.zhu@linux.dev
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If the state is not idle then resolve_prepare_src() should immediately
fail and no change to global state should happen. However, it
unconditionally overwrites the src_addr trying to build a temporary any
address.
For instance if the state is already RDMA_CM_LISTEN then this will corrupt
the src_addr and would cause the test in cma_cancel_operation():
if (cma_any_addr(cma_src_addr(id_priv)) && !id_priv->cma_dev)
Which would manifest as this trace from syzkaller:
BUG: KASAN: use-after-free in __list_add_valid+0x93/0xa0 lib/list_debug.c:26
Read of size 8 at addr ffff8881546491e0 by task syz-executor.1/32204
CPU: 1 PID: 32204 Comm: syz-executor.1 Not tainted 5.12.0-rc8-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x141/0x1d7 lib/dump_stack.c:120
print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232
__kasan_report mm/kasan/report.c:399 [inline]
kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
__list_add_valid+0x93/0xa0 lib/list_debug.c:26
__list_add include/linux/list.h:67 [inline]
list_add_tail include/linux/list.h:100 [inline]
cma_listen_on_all drivers/infiniband/core/cma.c:2557 [inline]
rdma_listen+0x787/0xe00 drivers/infiniband/core/cma.c:3751
ucma_listen+0x16a/0x210 drivers/infiniband/core/ucma.c:1102
ucma_write+0x259/0x350 drivers/infiniband/core/ucma.c:1732
vfs_write+0x28e/0xa30 fs/read_write.c:603
ksys_write+0x1ee/0x250 fs/read_write.c:658
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
This is indicating that an rdma_id_private was destroyed without doing
cma_cancel_listens().
Instead of trying to re-use the src_addr memory to indirectly create an
any address derived from the dst build one explicitly on the stack and
bind to that as any other normal flow would do. rdma_bind_addr() will copy
it over the src_addr once it knows the state is valid.
This is similar to commit bc0bdc5afa ("RDMA/cma: Do not change
route.addr.src_addr.ss_family")
Link: https://lore.kernel.org/r/0-v2-e975c8fd9ef2+11e-syz_cma_srcaddr_jgg@nvidia.com
Cc: stable@vger.kernel.org
Fixes: 732d41c545 ("RDMA/cma: Make the locking for automatic state transition more clear")
Reported-by: syzbot+c94a3675a626f6333d74@syzkaller.appspotmail.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Firstly the variable saddr was to check the type of a network. Now the
variable net_type is used to do the same work. So it is removed.
Link: https://lore.kernel.org/r/20220223024252.3873736-3-yanjun.zhu@linux.dev
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Collect cleanup code for struct rxe_mca into a subroutine,
__rxe_cleanup_mca() called in rxe_detach_mcg() in rxe_mcast.c.
Link: https://lore.kernel.org/r/20220223230706.50332-4-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>