1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Linux kernel source tree
Find a file
Yishai Hadas d97505baea RDMA/mlx5: Fix the recovery flow of the UMR QP
This patch addresses an issue in the recovery flow of the UMR QP,
ensuring tasks do not get stuck, as highlighted by the call trace [1].

During recovery, before transitioning the QP to the RESET state, the
software must wait for all outstanding WRs to complete.

Failing to do so can cause the firmware to skip sending some flushed
CQEs with errors and simply discard them upon the RESET, as per the IB
specification.

This race condition can result in lost CQEs and tasks becoming stuck.

To resolve this, the patch sends a final WR which serves only as a
barrier before moving the QP state to RESET.

Once a CQE is received for that final WR, it guarantees that no
outstanding WRs remain, making it safe to transition the QP to RESET and
subsequently back to RTS, restoring proper functionality.

Note:
For the barrier WR, we simply reuse the failed and ready WR.
Since the QP is in an error state, it will only receive
IB_WC_WR_FLUSH_ERR. However, as it serves only as a barrier we don't
care about its status.

[1]
INFO: task rdma_resource_l:1922 blocked for more than 120 seconds.
Tainted: G        W          6.12.0-rc7+ #1626
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:rdma_resource_l state:D stack:0  pid:1922 tgid:1922  ppid:1369
     flags:0x00004004
Call Trace:
<TASK>
__schedule+0x420/0xd30
schedule+0x47/0x130
schedule_timeout+0x280/0x300
? mark_held_locks+0x48/0x80
? lockdep_hardirqs_on_prepare+0xe5/0x1a0
wait_for_completion+0x75/0x130
mlx5r_umr_post_send_wait+0x3c2/0x5b0 [mlx5_ib]
? __pfx_mlx5r_umr_done+0x10/0x10 [mlx5_ib]
mlx5r_umr_revoke_mr+0x93/0xc0 [mlx5_ib]
__mlx5_ib_dereg_mr+0x299/0x520 [mlx5_ib]
? _raw_spin_unlock_irq+0x24/0x40
? wait_for_completion+0xfe/0x130
? rdma_restrack_put+0x63/0xe0 [ib_core]
ib_dereg_mr_user+0x5f/0x120 [ib_core]
? lock_release+0xc6/0x280
destroy_hw_idr_uobject+0x1d/0x60 [ib_uverbs]
uverbs_destroy_uobject+0x58/0x1d0 [ib_uverbs]
uobj_destroy+0x3f/0x70 [ib_uverbs]
ib_uverbs_cmd_verbs+0x3e4/0xbb0 [ib_uverbs]
? __pfx_uverbs_destroy_def_handler+0x10/0x10 [ib_uverbs]
? __lock_acquire+0x64e/0x2080
? mark_held_locks+0x48/0x80
? find_held_lock+0x2d/0xa0
? lock_acquire+0xc1/0x2f0
? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs]
? __fget_files+0xc3/0x1b0
ib_uverbs_ioctl+0xe7/0x170 [ib_uverbs]
? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs]
__x64_sys_ioctl+0x1b0/0xa70
do_syscall_64+0x6b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f99c918b17b
RSP: 002b:00007ffc766d0468 EFLAGS: 00000246 ORIG_RAX:
     0000000000000010
RAX: ffffffffffffffda RBX: 00007ffc766d0578 RCX:
     00007f99c918b17b
RDX: 00007ffc766d0560 RSI: 00000000c0181b01 RDI:
     0000000000000003
RBP: 00007ffc766d0540 R08: 00007f99c8f99010 R09:
     000000000000bd7e
R10: 00007f99c94c1c70 R11: 0000000000000246 R12:
     00007ffc766d0530
R13: 000000000000001c R14: 0000000040246a80 R15:
     0000000000000000
</TASK>

Fixes: 158e71bb69 ("RDMA/mlx5: Add a umr recovery flow")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/27b51b92ec42dfb09d8096fcbd51878f397ce6ec.1737290141.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-03 06:35:27 -05:00
arch sh updates for v6.14 2025-02-02 10:40:27 -08:00
block block-6.14-20250131 2025-01-31 11:49:30 -08:00
certs sign-file,extract-cert: use pkcs11 provider for OPENSSL MAJOR >= 3 2024-09-20 19:52:48 +03:00
crypto treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
Documentation 21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 2025-02-01 09:49:20 -08:00
drivers RDMA/mlx5: Fix the recovery flow of the UMR QP 2025-02-03 06:35:27 -05:00
fs assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
include assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
init Kbuild updates for v6.14 2025-01-31 12:07:07 -08:00
io_uring io_uring-6.14-20250131 2025-01-31 11:29:23 -08:00
ipc treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
kernel 21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 2025-02-01 09:49:20 -08:00
lib 21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 2025-02-01 09:49:20 -08:00
LICENSES LICENSES: add 0BSD license text 2024-09-01 20:43:24 -07:00
mm assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
net assorted stuff for this merge window 2025-02-01 15:07:56 -08:00
rust Kbuild updates for v6.14 2025-01-31 12:07:07 -08:00
samples AT_EXECVE_CHECK update for v6.14-rc1 (fix1) 2025-01-31 17:12:31 -08:00
scripts 21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 2025-02-01 09:49:20 -08:00
security treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
sound sound fixes for 6.14-rc1 2025-01-31 09:17:02 -08:00
tools Turbostat 2025.02.02 updates since 2024.11.30 2025-02-02 10:49:13 -08:00
usr kbuild: Drop support for include/asm-<arch> in headers_check.pl 2024-12-21 11:43:17 +09:00
virt Merge branch 'kvm-mirror-page-tables' into HEAD 2025-01-20 07:15:58 -05:00
.clang-format clang-format: Update with v6.11-rc1's for_each macro list 2024-08-02 13:20:31 +02:00
.clippy.toml rust: give Clippy the minimum supported Rust version 2025-01-10 00:17:25 +01:00
.cocciconfig scripts: add Linux .cocciconfig for coccinelle 2016-07-22 12:13:39 +02:00
.editorconfig .editorconfig: remove trim_trailing_whitespace option 2024-06-13 16:47:52 +02:00
.get_maintainer.ignore MAINTAINERS: Retire Ralf Baechle 2024-11-12 15:48:59 +01:00
.gitattributes .gitattributes: set diff driver for Rust source code files 2023-05-31 17:48:25 +02:00
.gitignore rust: use host dylib naming convention to support macOS 2025-01-10 01:01:24 +01:00
.mailmap 21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 2025-02-01 09:49:20 -08:00
.rustfmt.toml rust: add .rustfmt.toml 2022-09-28 09:02:20 +02:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS Mainly individually changelogged singleton patches. The patch series in 2025-01-26 17:50:53 -08:00
Kbuild Kbuild updates for v6.1 2022-10-10 12:00:45 -07:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS 21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 2025-02-01 09:49:20 -08:00
Makefile Linux 6.14-rc1 2025-02-02 15:39:26 -08:00
README README: Fix spelling 2024-03-18 03:36:32 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.