1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

44 commits

Author SHA1 Message Date
Yishai Hadas
821b8f6bf8 vfio/mlx5: Enforce PRE_COPY support
Enable live migration only once the firmware supports PRE_COPY.

PRE_COPY has been supported by the firmware for a long time already [1]
and is required to achieve a low downtime upon live migration.

This lets us clean up some old code that is not applicable those days
while PRE_COPY is fully supported by the firmware.

[1] The minimum firmware version that supports PRE_COPY is 28.36.1010,
it was released in January 2023.

No firmware without PRE_COPY support ever available to users.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20240306105624.114830-1-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11 12:02:59 -06:00
Yishai Hadas
6de042240b vfio/mlx5: Let firmware knows upon leaving PRE_COPY back to RUNNING
Let firmware knows upon leaving PRE_COPY back to RUNNING as of some
error in the target/migration cancellation.

This will let firmware cleaning its internal resources that were turned
on upon PRE_COPY.

The flow is based on the device specification in this area.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Leon Romanovsky <leon@kernel.org>
Link: https://lore.kernel.org/r/20240205124828.232701-6-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22 12:17:32 -07:00
Yishai Hadas
d8d577b5fa vfio/mlx5: Block incremental query upon migf state error
Block incremental query which is state-dependent once the migration file
was previously marked with state error.

This may prevent redundant calls to firmware upon PRE_COPY which will
end-up with a failure and a syndrome printed in dmesg.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Leon Romanovsky <leon@kernel.org>
Link: https://lore.kernel.org/r/20240205124828.232701-5-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22 12:17:32 -07:00
Yishai Hadas
793d4bfa31 vfio/mlx5: Handle the EREMOTEIO error upon the SAVE command
The SAVE command uses the async command interface over the PF.

Upon a failure in the firmware -EREMOTEIO is returned.

In that case call mlx5_cmd_out_err() to let it print the command failure
details including the firmware syndrome.

Note:
The other commands in the driver use the sync command interface in a way
that a firmware syndrome is printed upon an error inside mlx5_core.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Leon Romanovsky <leon@kernel.org>
Link: https://lore.kernel.org/r/20240205124828.232701-4-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22 12:17:32 -07:00
Yishai Hadas
f886473071 vfio/mlx5: Add support for tracker object change event
Add support for tracker object change event by referring to its
MLX5_EVENT_TYPE_OBJECT_CHANGE event when occurs.

This lets the driver recognize whether the firmware moved the tracker
object to an error state.

In that case, the driver will skip/block any usage of that object
including an early exit in case the object was previously marked with an
error.

This functionality also covers the case when no CQE is delivered as of
the error state.

The driver was adapted to the device specification to handle the above.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Leon Romanovsky <leon@kernel.org>
Link: https://lore.kernel.org/r/20240205124828.232701-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22 12:17:32 -07:00
Yishai Hadas
fcb2f2ed4a vfio/mlx5: Activate the chunk mode functionality
Now that all pieces are in place, activate the chunk mode functionality
based on device capabilities.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-10-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
67135f2945 vfio/mlx5: Add support for SAVING in chunk mode
Add support for SAVING in chunk mode, it includes running a work
that will fill the next chunk from the device.

In case the number of available chunks will reach the MAX_NUM_CHUNKS,
the next chunk SAVING will be delayed till the reader will consume one
chunk.

The next patch from the series will add the reader part of the chunk
mode.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-8-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
5798e4dd58 vfio/mlx5: Pre-allocate chunks for the STOP_COPY phase
This patch is another preparation step towards working in chunk mode.

It pre-allocates chunks for the STOP_COPY phase to let the driver use
them immediately and prevent an extra allocation upon that phase.

Before that patch we had a single large buffer that was dedicated for
the STOP_COPY phase as there was a single SAVE in the source for the
last image.

Once we'll move to chunk mode the idea is to have some small buffers
that will be used upon the STOP_COPY phase.

The driver will read-ahead from the firmware the full state in
small/optimized chunks while letting QEMU/user space read in parallel
the available data.

Each buffer holds its chunk number to let it be recognized down the road
in the coming patches.

The chunk buffer size is picked-up based on the minimum size that
firmware requires, the total full size and some max value in the driver
code which was set to 8MB to achieve some optimized downtime in the
general case.

As the chunk mode is applicable even if we move directly to STOP_COPY
the buffers preparation and some other related stuff is done
unconditionally with regards to STOP/PRE-COPY.

Note:
In that phase in the series we still didn't activate the chunk mode and
the first buffer will be used in all the places.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-7-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
9114100d10 vfio/mlx5: Rename some stuff to match chunk mode
Upon chunk mode there may be multiple images that will be read from the
device upon STOP_COPY.

This patch is some preparation for that mode by replacing the relevant
stuff to a better matching name.

As part of that, be stricter to recognize PRE_COPY error only when it
didn't occur on a STOP_COPY chunk.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-6-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
543640af84 vfio/mlx5: Enable querying state size which is > 4GB
Once the device supports 'chunk mode' the driver can support state size
which is larger than 4GB.

In that case the device has the capability to split a single image to
multiple chunks as long as the software provides a buffer in the minimum
size reported by the device.

The driver should query for the minimum buffer size required using
QUERY_VHCA_MIGRATION_STATE command with the 'chunk' bit set in its
input, in that case, the output will include both the minimum buffer
size (i.e.  required_umem_size) and also the remaining total size to be
reported/used where that it will be applicable.

At that point in the series the 'chunk' bit is off, the last patch will
activate the feature once all pieces will be ready.

Note:
Before this change we were limited to 4GB state size as of 4 bytes max
value based on the device specification for the query/save/load
commands.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-5-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
34a64c8eac vfio/mlx5: Refactor the SAVE callback to activate a work only upon an error
Upon a successful SAVE callback there is no need to activate a work, all
the required stuff can be done directly.

As so, refactor the above flow to activate a work only upon an error.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-4-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Linus Torvalds
ec0e2dc810 VFIO updates for v6.6-rc1
- VFIO direct character device (cdev) interface support.  This extracts
    the vfio device fd from the container and group model, and is intended
    to be the native uAPI for use with IOMMUFD. (Yi Liu)
 
  - Enhancements to the PCI hot reset interface in support of cdev usage.
    (Yi Liu)
 
  - Fix a potential race between registering and unregistering vfio files
    in the kvm-vfio interface and extend use of a lock to avoid extra
    drop and acquires. (Dmitry Torokhov)
 
  - A new vfio-pci variant driver for the AMD/Pensando Distributed Services
    Card (PDS) Ethernet device, supporting live migration. (Brett Creeley)
 
  - Cleanups to remove redundant owner setup in cdx and fsl bus drivers,
    and simplify driver init/exit in fsl code. (Li Zetao)
 
  - Fix uninitialized hole in data structure and pad capability structures
    for alignment. (Stefan Hajnoczi)
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmTvnDUbHGFsZXgud2ls
 bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsimEEP/AzG+VRcu5LfYbLGLe0z
 zB8ts6G7S78wXlmfN/LYi3v92XWvMMcm+vYF8oNAMfr1YL5sibWN6UtQfY1KCr7h
 nWKdQdqjajJ5yDDZnOFdhqHJGNfmZw6+fey8Z0j8zRI2oymK4DncWWX3g/7L1SNr
 9tIexGJef+mOdAmC94yOut3YviAaZ+f95T/xrdXHzzoNr50DD0+PD6AJdKJfKggP
 vhiC/DAYH3Fofaa6tRasgWuKCYWdjZLR/kxgNpeEmW6kZnbq/dnzZ+kgn4HH1f9G
 8p7UKVARR6FfG5aLheWu6Y9PDaKnfnqu8y/hobuE/ivXcmqqK+a6xSxrjgbVs8WJ
 94SYnTBRoTlDJaKWa7GxqdgzJnV+s5ZyAgPhjzdi6mLTPWGzkuLhFWGtYL+LZAQ6
 pNeZSM6CFBk+bva/xT0nNPCXxPh+/j/Y0G18FREj8aPFc03HrJQqz0RLydvTnoDz
 nX/by5KdzMSVSVLPr4uDMtAsgxsGqWiFcp7QMw1HhhlLWxqmYbA+mLZaqyMZUUOx
 6b/P8WXT9P2I+qPVKWQ5CWyqpsEqm6P+72yg6LOM9kINvgwDhOa7cagMXIuMWYMH
 Rf97FL+K8p1eIy6AnvRHgFBMM5185uG+0YcJyVqtucDr/k8T/Om6ujAI6JbWtNe6
 cLgaVAqKOYqCR4HC9bfVGSbd
 =eKSR
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - VFIO direct character device (cdev) interface support. This extracts
   the vfio device fd from the container and group model, and is
   intended to be the native uAPI for use with IOMMUFD (Yi Liu)

 - Enhancements to the PCI hot reset interface in support of cdev usage
   (Yi Liu)

 - Fix a potential race between registering and unregistering vfio files
   in the kvm-vfio interface and extend use of a lock to avoid extra
   drop and acquires (Dmitry Torokhov)

 - A new vfio-pci variant driver for the AMD/Pensando Distributed
   Services Card (PDS) Ethernet device, supporting live migration (Brett
   Creeley)

 - Cleanups to remove redundant owner setup in cdx and fsl bus drivers,
   and simplify driver init/exit in fsl code (Li Zetao)

 - Fix uninitialized hole in data structure and pad capability
   structures for alignment (Stefan Hajnoczi)

* tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio: (53 commits)
  vfio/pds: Send type for SUSPEND_STATUS command
  vfio/pds: fix return value in pds_vfio_get_lm_file()
  pds_core: Fix function header descriptions
  vfio: align capability structures
  vfio/type1: fix cap_migration information leak
  vfio/fsl-mc: Use module_fsl_mc_driver macro to simplify the code
  vfio/cdx: Remove redundant initialization owner in vfio_cdx_driver
  vfio/pds: Add Kconfig and documentation
  vfio/pds: Add support for firmware recovery
  vfio/pds: Add support for dirty page tracking
  vfio/pds: Add VFIO live migration support
  vfio/pds: register with the pds_core PF
  pds_core: Require callers of register/unregister to pass PF drvdata
  vfio/pds: Initial support for pds VFIO driver
  vfio: Commonize combine_ranges for use in other VFIO drivers
  kvm/vfio: avoid bouncing the mutex when adding and deleting groups
  kvm/vfio: ensure kvg instance stays around in kvm_vfio_group_add()
  docs: vfio: Add vfio device cdev description
  vfio: Compile vfio_group infrastructure optionally
  vfio: Move the IOMMU_CAP_CACHE_COHERENCY check in __vfio_register_dev()
  ...
2023-08-30 20:36:01 -07:00
Brett Creeley
9a4087fab3 vfio: Commonize combine_ranges for use in other VFIO drivers
Currently only Mellanox uses the combine_ranges function. The
new pds_vfio driver also needs this function. So, move it to
a common location for other vendor drivers to use.

Also, fix RCT ordering while moving/renaming the function.

Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20230807205755.29579-2-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-16 10:52:23 -06:00
Maher Sanalla
f14c1a14e6 net/mlx5: Allocate completion EQs dynamically
This commit enables the dynamic allocation of EQs at runtime, allowing
for more flexibility in managing completion EQs and reducing the memory
overhead of driver load. Whenever a CQ is created for a given vector
index, the driver will lookup to see if there is an already mapped
completion EQ for that vector, if so, utilize it. Otherwise, allocate a
new EQ on demand and then utilize it for the CQ completion events.

Add a protection lock to the EQ table to protect from concurrent EQ
creation attempts.

While at it, replace mlx5_vector2irqn()/mlx5_vector2eqn() with
mlx5_comp_eqn_get() and mlx5_comp_irqn_get() which will allocate an
EQ on demand if no EQ is found for the given vector.

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-07 10:53:52 -07:00
Maher Sanalla
674dd4e2e0 net/mlx5: Rename mlx5_comp_vectors_count() to mlx5_comp_vectors_max()
To accurately represent its purpose, rename the function that retrieves
the value of maximum vectors from mlx5_comp_vectors_count() to
mlx5_comp_vectors_max().

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-07 10:53:51 -07:00
Yishai Hadas
ce06a7000f vfio/mlx5: Fix range size calculation upon tracker creation
Fix range size calculation to include the last byte of each range.

In addition, log round up the length of the total ranges to be stricter.

Fixes: c1d050b0d1 ("vfio/mlx5: Create and destroy page tracker object")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20230208152234.32370-1-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-02-09 11:43:06 -07:00
Yishai Hadas
b04e2e86e9 vfio/mlx5: Improve the source side flow upon pre_copy
Improve the source side flow upon pre_copy as of below.

- Prepare the stop_copy buffers as part of moving to pre_copy.
- Send to the target a record that includes the expected
  stop_copy size to let it optimize its stop_copy flow as well.

As for sending the target this new record type (i.e.
MLX5_MIGF_HEADER_TAG_STOP_COPY_SIZE) we split the current 64 header
flags bits into 32 flags bits and another 32 tag bits, each record may
have a tag and a flag whether it's optional or mandatory. Optional
records will be ignored in the target.

The above reduces the downtime upon stop_copy as the relevant data stuff
is prepared ahead as part of pre_copy.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20230124144955.139901-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-01-30 12:16:15 -07:00
Shay Drory
caf094b5a1 vfio/mlx5: Check whether VF is migratable
Add a check whether VF is migratable. Only if VF is migratable, mark the
VFIO device as migration capable.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20230124144955.139901-2-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-01-30 12:16:15 -07:00
Yishai Hadas
83ff6095ec vfio/mlx5: Allow loading of larger images than 512 MB
Allow loading of larger images than 512 MB by dropping the arbitrary
hard-coded value that we have today and move to use the max device
loading value which is for now 4GB.

As part of that we move to use the GFP_KERNEL_ACCOUNT option upon
allocating the persistent data of mlx5 and rely on the cgroup to provide
the memory limit for the given user.

The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this
is untrusted allocation triggered from userspace and should be a subject
of kmem accounting, and as such it is controlled by the cgroup
mechanism.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230108154427.32609-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-01-23 11:26:29 -07:00
Yishai Hadas
c9c4c070e0 vfio/mlx5: Fix UBSAN note
Prevent calling roundup_pow_of_two() with value of 0 as it causes the
below UBSAN note.

Move this code and its few extra related lines to be called only when
it's really applicable.

UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 15 PID: 1639 Comm: live_migration Not tainted 6.1.0-rc4 #1116
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
dump_stack_lvl+0x45/0x59
ubsan_epilogue+0x5/0x36
 __ubsan_handle_shift_out_of_bounds.cold+0x61/0xef
? lock_is_held_type+0x98/0x110
? rcu_read_lock_sched_held+0x3f/0x70
mlx5vf_create_rc_qp.cold+0xe4/0xf2 [mlx5_vfio_pci]
mlx5vf_start_page_tracker+0x769/0xcd0 [mlx5_vfio_pci]
 vfio_device_fops_unl_ioctl+0x63f/0x700 [vfio]
__x64_sys_ioctl+0x433/0x9a0
do_syscall_64+0x3d/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
 </TASK>

Fixes: 79c3cf2799 ("vfio/mlx5: Init QP based resources for dirty tracking")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230108154427.32609-2-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-01-23 11:26:29 -07:00
Shay Drory
ccc2a52e46 vfio/mlx5: Enable MIGRATION_PRE_COPY flag
Now that everything has been set up for MIGRATION_PRE_COPY, enable it.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-15-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Shay Drory
d6e18a4bec vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY error
Before a SAVE command is issued, a QUERY command is issued in order to
know the device data size.
In case PRE_COPY is used, the above commands are issued while the device
is running. Thus, it is possible that between the QUERY and the SAVE
commands the state of the device will be changed significantly and thus
the SAVE will fail.

Currently, if a SAVE command is failing, the driver will fail the
migration. In the above case, don't fail the migration, but don't allow
for new SAVEs to be executed while the device is in a RUNNING state.
Once the device will be moved to STOP_COPY, SAVE can be executed again
and the full device state will be read.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-14-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Yishai Hadas
34e2f27143 vfio/mlx5: Introduce multiple loads
In order to support PRE_COPY, mlx5 driver transfers multiple states
(images) of the device. e.g.: the source VF can save and transfer
multiple states, and the target VF will load them by that order.

This patch implements the changes for the target VF to decompose the
header for each state and to write and load multiple states.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-13-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Yishai Hadas
81156c2727 vfio/mlx5: Consider temporary end of stream as part of PRE_COPY
During PRE_COPY the migration data FD may have a temporary "end of
stream" that is reached when the initial_bytes were read and no other
dirty data exists yet.

For instance, this may indicate that the device is idle and not
currently dirtying any internal state. When read() is done on this
temporary end of stream the kernel driver should return ENOMSG from
read(). Userspace can wait for more data or consider moving to
STOP_COPY.

To not block the user upon read() and let it get ENOMSG we add a new
state named MLX5_MIGF_STATE_PRE_COPY on the migration file.

In addition, we add the MLX5_MIGF_STATE_SAVE_LAST state to block the
read() once we call the last SAVE upon moving to STOP_COPY.

Any further error will be marked with MLX5_MIGF_STATE_ERROR and the user
won't be blocked.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-12-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Yishai Hadas
0dce165b1a vfio/mlx5: Introduce vfio precopy ioctl implementation
vfio precopy ioctl returns an estimation of data available for
transferring from the device.

Whenever a user is using VFIO_MIG_GET_PRECOPY_INFO, track the current
state of the device, and if needed, append the dirty data to the
transfer FD data. This is done by saving a middle state.

As mlx5 runs the SAVE command asynchronously, make sure to query for
incremental data only once there is no active save command.
Running both in parallel, might end-up with a failure in the incremental
query command on un-tracked vhca.

Also, a middle state will be saved only after the previous state has
finished its SAVE command and has been fully transferred, this prevents
endless use resources.

Co-developed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-11-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Yishai Hadas
0c9a38fee8 vfio/mlx5: Introduce SW headers for migration states
As mentioned in the previous patches, mlx5 is transferring multiple
states when the PRE_COPY protocol is used. This states mechanism
requires the target VM to know the states' size in order to execute
multiple loads.  Therefore, add SW header, with the needed information,
for each saved state the source VM is transferring to the target VM.

This patch implements the source VM handling of the headers, following
patch will implement the target VM handling of the headers.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-10-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Yishai Hadas
3319d287f4 vfio/mlx5: Introduce device transitions of PRE_COPY
In order to support PRE_COPY, mlx5 driver is transferring multiple
states (images) of the device. e.g.: the source VF can save and transfer
multiple states, and the target VF will load them by that order.

The device is saving three kinds of states:
1) Initial state - when the device moves to PRE_COPY state.
2) Middle state - during PRE_COPY phase via VFIO_MIG_GET_PRECOPY_INFO.
   There can be multiple states of this type.
3) Final state - when the device moves to STOP_COPY state.

After moving to PRE_COPY state, user is holding the saving migf FD and
can use it. For example: user can start transferring data via read()
callback. Also, user can switch from PRE_COPY to STOP_COPY whenever he
sees it fits. This will invoke saving of final state.

This means that mlx5 VFIO device can be switched to STOP_COPY without
transferring any data in PRE_COPY state. Therefore, when the device
moves to STOP_COPY, mlx5 will store the final state on a dedicated queue
entry on the list.

Co-developed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-9-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Yishai Hadas
c668878381 vfio/mlx5: Refactor to use queue based data chunks
Refactor to use queue based data chunks on the migration file.

The SAVE command adds a chunk to the tail of the queue while the read()
API finds the required chunk and returns its data.

In case the queue is empty but the state of the migration file is
MLX5_MIGF_STATE_COMPLETE, read() may not be blocked but will return 0 to
indicate end of file.

This is a step towards maintaining multiple images and their meta data
(i.e. headers) on the migration file as part of next patches from the
series.

Note:
At that point, we still use a single chunk on the migration file but
becomes ready to support multiple.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-8-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Yishai Hadas
8b599d1434 vfio/mlx5: Refactor migration file state
Refactor migration file state to be an emum which is mutual exclusive.

As of that dropped the 'disabled' state as 'error' is the same from
functional point of view.

Next patches from the series will extend this enum for other relevant
states.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-7-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Yishai Hadas
91454f8b9b vfio/mlx5: Refactor MKEY usage
This patch refactors MKEY usage such as its life cycle will be as of the
migration file instead of allocating/destroying it upon each
SAVE/LOAD command.

This is a preparation step towards the PRE_COPY series where multiple
images will be SAVED/LOADED.

We achieve it by having a new struct named mlx5_vhca_data_buffer which
holds the mkey and its related stuff as of sg_append_table,
allocated_length, etc.

The above fields were taken out from the migration file main struct,
into mlx5_vhca_data_buffer dedicated struct with the proper helpers in
place.

For now we have a single mlx5_vhca_data_buffer per migration file.
However, in coming patches we'll have multiple of them to support
multiple images.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-6-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Yishai Hadas
9945a67ea4 vfio/mlx5: Refactor PD usage
This patch refactors PD usage such as its life cycle will be as of the
migration file instead of allocating/destroying it upon each SAVE/LOAD
command.

This is a preparation step towards the PRE_COPY series where multiple
images will be SAVED/LOADED and a single PD can be simply reused.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-5-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Yishai Hadas
0e7caa65d7 vfio/mlx5: Enforce a single SAVE command at a time
Enforce a single SAVE command at a time.

As the SAVE command is an asynchronous one, we must enforce running only
a single command at a time.

This will preserve ordering between multiple calls and protect from
races on the migration file data structure.

This is a must for the next patches from the series where as part of
PRE_COPY we may have multiple images to be saved and multiple SAVE
commands may be issued from different flows.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20221206083438.37807-4-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-12-06 12:36:44 -07:00
Yishai Hadas
2f5d8cef45 vfio/mlx5: Fix a typo in mlx5vf_cmd_load_vhca_state()
Fix a typo in mlx5vf_cmd_load_vhca_state() to use the 'load' memory
layout.

As in/out sizes are equal for save and load commands there wasn't any
functional issue.

Fixes: f1d98f346e ("vfio/mlx5: Expose migration commands over mlx5 device")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20221106174630.25909-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-11-14 11:37:07 -07:00
Yishai Hadas
f39856aacb vfio/mlx5: Set the driver DMA logging callbacks
Now that everything is ready set the driver DMA logging callbacks if
supported by the device.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-11-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-09-08 12:59:01 -06:00
Yishai Hadas
e295738756 vfio/mlx5: Manage error scenarios on tracker
Handle async error events and health/recovery flow to safely stop the
tracker upon error scenarios.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-10-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-09-08 12:59:01 -06:00
Yishai Hadas
1047797e8e vfio/mlx5: Report dirty pages from tracker
Report dirty pages from tracker.

It includes:
Querying for dirty pages in a given IOVA range, this is done by
modifying the tracker into the reporting state and supplying the
required range.

Using the CQ event completion mechanism to be notified once data is
ready on the CQ/QP to be processed.

Once data is available turn on the corresponding bits in the bit map.

This functionality will be used as part of the 'log_read_and_clear'
driver callback in the next patches.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-9-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-09-08 12:59:01 -06:00
Yishai Hadas
c1d050b0d1 vfio/mlx5: Create and destroy page tracker object
Add support for creating and destroying page tracker object.

This object is used to control/report the device dirty pages.

As part of creating the tracker need to consider the device capabilities
for max ranges and adapt/combine ranges accordingly.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-8-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-09-08 12:59:01 -06:00
Yishai Hadas
79c3cf2799 vfio/mlx5: Init QP based resources for dirty tracking
Init QP based resources for dirty tracking to be used upon start
logging.

It includes:
Creating the host and firmware RC QPs, move each of them to its expected
state based on the device specification, etc.

Creating the relevant resources which are needed by both QPs as of UAR,
PD, etc.

Creating the host receive side resources as of MKEY, CQ, receive WQEs,
etc.

The above resources are cleaned-up upon stop logging.

The tracker object that will be introduced by next patches will use
those resources.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220908183448.195262-7-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-09-08 12:59:00 -06:00
Yishai Hadas
6e97eba8ad vfio: Split migration ops from main device ops
vfio core checks whether the driver sets some migration op (e.g.
set_state/get_state) and accordingly calls its op.

However, currently mlx5 driver sets the above ops without regards to its
migration caps.

This might lead to unexpected usage/Oops if user space may call to the
above ops even if the driver doesn't support migration. As for example,
the migration state_mutex is not initialized in that case.

The cleanest way to manage that seems to split the migration ops from
the main device ops, this will let the driver setting them separately
from the main ops when it's applicable.

As part of that, validate ops construction on registration and include a
check for VFIO_MIGRATION_STOP_COPY since the uAPI claims it must be set
in migration_flags.

HISI driver was changed as well to match this scheme.

This scheme may enable down the road to come with some extra group of
ops (e.g. DMA log) that can be set without regards to the other options
based on driver caps.

Fixes: 6fadb02126 ("vfio/mlx5: Implement vfio_pci driver for mlx5 devices")
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220628155910.171454-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-06-30 10:47:22 -06:00
Yishai Hadas
2b1c190628 vfio/mlx5: Protect mlx5vf_disable_fds() upon close device
Protect mlx5vf_disable_fds() upon close device to be called under the
state mutex as done in all other places.

This will prevent a race with any other flow which calls
mlx5vf_disable_fds() as of health/recovery upon
MLX5_PF_NOTIFY_DISABLE_VF event.

Encapsulate this functionality in a separate function named
mlx5vf_cmd_close_migratable() to consider migration caps and for further
usage upon close device.

Fixes: 6fadb02126 ("vfio/mlx5: Implement vfio_pci driver for mlx5 devices")
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20220628155910.171454-2-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-06-30 10:45:39 -06:00
Yishai Hadas
85c205db60 vfio/mlx5: Run the SAVE state command in an async mode
Use the PF asynchronous command mode for the SAVE state command.

This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.

Link: https://lore.kernel.org/r/20220510090206.90374-5-yishaih@nvidia.com
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2022-05-11 09:33:40 +03:00
Yishai Hadas
8580ad14f9 vfio/mlx5: Refactor to enable VFs migration in parallel
Refactor to enable different VFs to run their commands over the PF
command interface in parallel and to not block one each other.

This is done by not using the global PF lock that was used before but
relying on the VF attach/detach mechanism to sync.

Link: https://lore.kernel.org/r/20220510090206.90374-4-yishaih@nvidia.com
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2022-05-11 09:33:33 +03:00
Yishai Hadas
61a2f1460f vfio/mlx5: Manage the VF attach/detach callback from the PF
Manage the VF attach/detach callback from the PF.

This lets the driver to enable parallel VFs migration as will be
introduced in the next patch.

As part of this, reorganize the VF is migratable code to be in a
separate function and rename it to be set_migratable() to match its
functionality.

Link: https://lore.kernel.org/r/20220510090206.90374-3-yishaih@nvidia.com
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2022-05-11 09:33:22 +03:00
Yishai Hadas
f1d98f346e vfio/mlx5: Expose migration commands over mlx5 device
Expose migration commands over the device, it includes: suspend, resume,
get vhca id, query/save/load state.

As part of this adds the APIs and data structure that are needed to manage
the migration data.

Link: https://lore.kernel.org/all/20220224142024.147653-13-yishaih@nvidia.com
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2022-03-03 13:01:19 +02:00