linux

mirror of synced 2025-03-06 20:59:54 +01:00

Author	SHA1	Message	Date
Matthew Brost	0417a5f848	drm/xe: Always capture exec queues on snapshot Always capture exec queues on snapshot regardless if exec queue has pending jobs or not. Having jobs or not does indicate whether the exec queue capture is useful. Example bugs that would not be easily detected by skipping capture when pending job list is empty: - Jobs pending on exec queue have dependencies - Leaking exec queue refs - GuC protocol issues (i.e. losing G2H) In addition to above bugs, in general it just useful to see every exec queue registered with the GuC and its state. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240405211632.223568-2-matthew.brost@intel.com	2024-04-08 14:47:37 -07:00
Michal Wajdeczko	83787afe06	drm/xe/guc: Initialize GuC ID manager sooner The GuC submission cleanup code may depend on the GuC ID manager, thus we can't initialize it after registering a submission cleanup action, as reverse cleanup sequence will destroy GuC ID manager prior to a call to guc_submit_fini(). Move GuC ID manager initialization up, right after managed mutex initialization, to have it available during guc_submit_fini(). Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240406143946.979-2-michal.wajdeczko@intel.com	2024-04-08 11:22:18 +02:00
Michal Wajdeczko	104f7519db	drm/xe/guc: Use drm_device-managed version of mutex_init() This is safer approach and will help resolve a cleanup ordering conflict related to the GuC ID manager. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240406143946.979-1-michal.wajdeczko@intel.com	2024-04-08 11:22:17 +02:00
Rodrigo Vivi	972d01d0e3	drm/xe: Protect devcoredump access after unbind While we don't have the full flow protection when devcoredump is accessed after device unbind. Let's at least for now protect against null dereference: [ 422.766508] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] [ 423.119584] RIP: 0010:xe_vm_snapshot_free+0x30/0x180 [xe] While at it, I also fixed a non-standard code-declaration block on the similar function of xe_guc_submit. v2: - Use IS_ERR_OR_NULL (Nirmoy) - Expand to other functions Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240403195044.239766-1-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2024-04-04 14:53:22 -04:00
Michal Wajdeczko	e6e7eff627	drm/xe/guc: Use GuC ID Manager in submission code We are ready to replace private guc_ids management code with separate GuC ID Manager that can be shared with upcoming SR-IOV PF provisioning code. Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240313221112.1089-5-michal.wajdeczko@intel.com	2024-03-27 20:19:29 +01:00
Michal Wajdeczko	f88beeed82	drm/xe/guc: Move GUC_ID_MAX definition to GuC ABI header This macro represents GuC firmware capability and shall be defined in the firmware ABI header. Move it to xe_guc_fwif.h file. Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240313221112.1089-2-michal.wajdeczko@intel.com	2024-03-27 20:19:23 +01:00
Daniele Ceraolo Spurio	4c15a6dcee	drm/xe/uc: Use u64 for offsets for which we use upper_32_bits() The GGTT is currently a 32 bit address space, but the HW and GuC support 48b addresses in GGTT-related operations, both to keep the interface/HW paths common between PPGTT and GGTT and to allow for future increase of the GGTT size. This leaves us having to program a 64b field with a 32b offset, which currently we're in some cases doing this by using an upper_32_bits() call on a 32b variable, which doesn't make any sense. To do this cleanly we have 2 options: 1 - Set the upper 32 bits directly to zero. 2 - Use 64b variables for the offset and keep programming the whole thing, so we're ready if we ever have bigger offsets. This patch goes with option #2 and switches the related variables to u64. v2: don't change the log ctl flag variable (John) Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240319195101.2784480-1-daniele.ceraolospurio@intel.com	2024-03-20 14:40:57 -07:00
Daniele Ceraolo Spurio	649a125a88	drm/xe: Always check force_wake_get return code A force_wake_get failure means that the HW might not be awake for the access we're doing; this can lead to an immediate error or it can be a more subtle problem (e.g. a register read might return an incorrect value that is still valid, leading the driver to make a wrong choice instead of flagging an error). We avoid an error from the force_wake function because callers might handle or tolerate the error, but this only works if all callers are checking the error code. The majority already do, but a few are not. These are mainly falling into 3 categories, which are each handled differently: 1) error capture: in this case we want to continue the capture, but we log an info message in dmesg to notify the user that the capture might have incorrect data. 2) ioctl: in this case we return a -EIO error to userspace 3) unabortable actions: these are scenarios where we can't simply abort and retry and so it's better to just try it anyway because there is a chance the HW is awake even with the failure. In this case we throw a warning so we know there was a forcewake problem if something fails down the line. v2: use gt_WARN_ON where appropriate Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240318154924.3453513-1-daniele.ceraolospurio@intel.com	2024-03-20 14:13:58 -07:00
Niranjana Vishwanathapura	aacf3f629a	drm/xe: Separate out sched/deregister_done handling Abstract out the core part of sched_done and deregister_done handlers to separate functions to decouple them from any protocol error handling part and make them more readable. Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240319184153.16667-1-niranjana.vishwanathapura@intel.com	2024-03-19 22:36:15 -07:00
Matthew Auld	2c5b70f74d	drm/xe/guc_submit: use jiffies for job timeout drm_sched_init() expects jiffies for the timeout, but here we are passing the timeout in ms. Convert to jiffies instead. Fixes: `eef55700f3` ("drm/xe: Add sysfs for default engine scheduler properties") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240314121554.223229-2-matthew.auld@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-03-14 10:11:34 -07:00
Maarten Lankhorst	784b34100f	drm/xe: Add infrastructure for delayed LRC capture Add a xe_guc_exec_queue_snapshot_capture_delayed and xe_lrc_snapshot_capture_delayed function to capture the contents of LRC in the next patch. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240227131248.92910-2-maarten.lankhorst@linux.intel.com	2024-03-04 12:24:38 +01:00
Maarten Lankhorst	47058633d9	drm/xe: Move lrc snapshot capturing to xe_lrc.c This allows the dumping of HWSP and HW Context without exporting more functions. Changes since v1: - GFP_KERNEL -> GFP_NOWAIT. (Souza) Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240227131248.92910-1-maarten.lankhorst@linux.intel.com	2024-03-04 12:23:12 +01:00
Matthew Brost	e275d61c5f	drm/xe/guc: Handle timing out of signaled jobs gracefully Timing out of signaled jobs can happen during regular operations (e.g. an exec queue closed immediately after last fence signaled). The TDR can pass the worker which free jobs. Rather than running through the TDR if signaled job is found, simply free it without any debug messages. Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reported-by: José Roberto de Souza <jose.souza@intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1271 Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Tested-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240223204659.40750-1-matthew.brost@intel.com	2024-02-26 13:07:18 -08:00
Maarten Lankhorst	8491b0ef32	drm/xe/snapshot: Remove drm_err on guc alloc failures The kernel will complain loudly if allocation fails, no need to do it ourselves. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240221133024.898315-1-maarten.lankhorst@linux.intel.com	2024-02-21 20:08:19 +01:00
Christophe JAILLET	69a5f1774a	drm/xe/guc: Remove usage of the deprecated ida_simple_xx() API ida_alloc() and ida_free() should be preferred to the deprecated ida_simple_get() and ida_simple_remove(). Note that the upper limit of ida_simple_get() is exclusive, but the one of ida_alloc_max() is inclusive. So a -1 has been added when needed. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/d6a9ec9dc426fca372eaa1423a83632bd743c5d9.1705244938.git.christophe.jaillet@wanadoo.fr	2024-02-20 12:38:43 -08:00
Lucas De Marchi	fbb944086f	Merge drm/drm-next into drm-xe-next Bring changes from drm-misc-next that got merged in drm-next back to drm-xe so they can be used for additional features. Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>	2024-02-20 09:57:17 -08:00
Thomas Hellström	f1a9abc0cf	drm/xe/uapi: Remove support for persistent exec_queues Persistent exec_queues delays explicit destruction of exec_queues until they are done executing, but destruction on process exit is still immediate. It turns out no UMD is relying on this functionality, so remove it. If there turns out to be a use-case in the future, let's re-add. Persistent exec_queues were never used for LR VMs v2: - Don't add an "UNUSED" define for the missing property (Lucas, Rodrigo) v3: - Remove the remaining struct xe_exec_queue::persistent state (Niranjana, Lucas) Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: David Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Francois Dugast <francois.dugast@intel.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Acked-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240209113444.8396-1-thomas.hellstrom@linux.intel.com	2024-02-19 12:54:48 +01:00
Jani Nikula	98459fb5ab	drm/xe: fix arguments to drm_err_printer() The commit below changed drm_err_printer() arguments, but failed to update all places. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/r/20240213120410.75c45763@canb.auug.org.au Fixes: `5e0c04c8c4` ("drm/print: make drm_err_printer() device specific by using drm_err()") Cc: Luca Coelho <luciano.coelho@intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240213084954.878643-1-jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com>	2024-02-13 12:02:08 +02:00
Matt Roper	996da37ffa	drm/xe: Convert job timeouts from assert to warning xe_assert() is intended to be used only for "impossible" situations that should never be hit (and if they are hit it means there's a driver bug somewhere); assertions are only compiled into debug builds. Although we expect jobs submitted by the kernel to be well-behaved and run without error, timeouts are a legitimate possibility for reasons beyond our control (bad firmware, flaky hardware, etc.). We should use a real WARN if we encounter these, even for non-debug builds, to ensure the issue is being properly highlighted in bug reports and such. Also give the WARNs more human-readable messages and move them below the general notice-level message that gets printed for any kind of timeout to make the errors a bit more understandable. v2: - Convert the VM / exec_queue_killed assertion as well. (MattB) Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240130200308.1429134-2-matthew.d.roper@intel.com	2024-01-31 08:09:30 -08:00
José Roberto de Souza	98fefec8c3	drm/xe: Change devcoredump functions parameters to xe_sched_job When devcoredump start to dump the VMs contents it will be necessary to know the starting addresses of batch buffers of the job that hang. This information it set in xe_sched_job and xe_sched_job is not easily acessible from xe_exec_queue, so here changing the parameter, next patch will append the batch buffer addresses to devcoredump snapshot capture. v3: - update functions documentation to xe_sched_job Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Maarten Lankhorst <dev@lankhorst.se> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240123204454.246788-2-jose.souza@intel.com	2024-01-24 10:53:38 -08:00
Brian Welty	19c0222524	drm/xe: Fix modifying exec_queue priority in xe_migrate_init After exec_queue has been created, we cannot simply modify q->priority. This needs to be done by the backend via q->ops. However in this case, it would be more efficient to simply pass a flag when creating the exec_queue and set the desired priority upfront during queue creation. To that end: new flag EXEC_QUEUE_FLAG_HIGH_PRIORITY is introduced. The priority field is moved to be with other scheduling properties and is now exec_queue.sched_props.priority. This is no longer set to initial value by the backend, but is now set within __xe_exec_queue_create(). Fixes: `b4eecedc75` ("drm/xe: Fix potential deadlock handling page faults") Signed-off-by: Brian Welty <brian.welty@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> (cherry picked from commit `a8004af338`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>	2024-01-15 15:36:50 +01:00
Brian Welty	fef257eb6d	drm/xe: Fix guc_exec_queue_set_priority We need to set q->priority prior to calling guc_exec_queue_add_msg() as that will call init_policies() and sets the scheduling properties to those stored in the exec_queue. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Brian Welty <brian.welty@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> (cherry picked from commit `b16483f9f8`) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>	2024-01-15 15:36:48 +01:00
Brian Welty	801e8c7ed6	drm/xe: Remove set_job_timeout_ms() from exec_queue_ops This function is no longer used as the job_timeout is now updated prior to calling queue_ops.init(). Signed-off-by: Brian Welty <brian.welty@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com>	2024-01-10 15:01:57 -08:00
Brian Welty	6ae24344e2	drm/xe: Add exec_queue.sched_props.job_timeout_ms The purpose here is to allow to optimize exec_queue_set_job_timeout() in follow-on patch. Currently it does q->ops->set_job_timeout(...). But we'd like to apply exec_queue_user_extensions much earlier and q->ops cannot be called before __xe_exec_queue_init(). It will be much more efficient to instead only have to set q->sched_props.job_timeout_ms when applying user extensions. That value will then be used during q->ops->init(). Signed-off-by: Brian Welty <brian.welty@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com>	2024-01-10 15:01:48 -08:00
Brian Welty	a8004af338	drm/xe: Fix modifying exec_queue priority in xe_migrate_init After exec_queue has been created, we cannot simply modify q->priority. This needs to be done by the backend via q->ops. However in this case, it would be more efficient to simply pass a flag when creating the exec_queue and set the desired priority upfront during queue creation. To that end: new flag EXEC_QUEUE_FLAG_HIGH_PRIORITY is introduced. The priority field is moved to be with other scheduling properties and is now exec_queue.sched_props.priority. This is no longer set to initial value by the backend, but is now set within __xe_exec_queue_create(). Fixes: `b4eecedc75` ("drm/xe: Fix potential deadlock handling page faults") Signed-off-by: Brian Welty <brian.welty@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com>	2024-01-09 14:11:58 -08:00
Brian Welty	b16483f9f8	drm/xe: Fix guc_exec_queue_set_priority We need to set q->priority prior to calling guc_exec_queue_add_msg() as that will call init_policies() and sets the scheduling properties to those stored in the exec_queue. Fixes: `dd08ebf6c3` ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Brian Welty <brian.welty@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com>	2024-01-09 14:11:38 -08:00
Bommu Krishnaiah	e670f0b4ef	drm/xe/uapi: Return correct error code for xe_wait_user_fence_ioctl Currently xe_wait_user_fence_ioctl is not checking exec_queue state and blocking until timeout, with this patch wakeup the blocking wait if exec_queue reset happen and returning proper error code Signed-off-by: Bommu Krishnaiah <krishnaiah.bommu@intel.com> Cc: Oak Zeng <oak.zeng@intel.com> Cc: Kempczynski Zbigniew <Zbigniew.Kempczynski@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: José Roberto de Souza <jose.souza@intel.com> Acked-by: Mateusz Naklicki <mateusz.naklicki@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:46:20 -05:00
Michal Wajdeczko	b67cb798e4	drm/xe/guc: Include only required GuC ABI headers On i915 we were adding new GuC ABI headers directly to guc_fwif.h file since we were replacing old definitions from that file. On xe driver we could do more and better by including ABI headers only in files that need those definitions. Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/741 Cc: Jani Nikula <jani.nikula@intel.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20231128203203.1147-3-michal.wajdeczko@intel.com Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:45:08 -05:00
Thomas Hellström	fdb6a05383	drm/xe: Internally change the compute_mode and no_dma_fence mode naming The name "compute_mode" can be confusing since compute uses either this mode or fault_mode to achieve the long-running semantics, and compute_mode can, moving forward, enable fault_mode under the hood to work around hardware limitations. Also the name no_dma_fence_mode really refers to what we elsewhere call long-running mode and the mode contrary to what its name suggests allows dma-fences as in-fences. So in an attempt to be more consistent, rename no_dma_fence_mode -> lr_mode compute_mode -> preempt_fence_mode And adjust flags so that preempt_fence_mode sets XE_VM_FLAG_LR_MODE fault_mode sets XE_VM_FLAG_LR_MODE \| XE_VM_FLAG_FAULT_MODE v2: - Fix a typo in the commit message (Oak Zeng) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Oak Zeng <oak.zeng@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231127123349.23698-1-thomas.hellstrom@linux.intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:44:58 -05:00
Michal Wajdeczko	3d78923bd0	drm/xe/guc: Promote guc_to_gt/xe helpers to .h Duplicating these helpers in almost every .c file is a bad idea. Define them as inlines in .h file to allow proper reuse. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:44:32 -05:00
Matthew Brost	a839e365ac	drm/xe: Use pool of ordered wq for GuC submission To appease lockdep, use a pool of ordered wq for GuC submission rather tha leaving the ordered wq allocation to the drm sched. Without this change eventually lockdep runs out of hash entries (MAX_LOCKDEP_CHAINS is exceeded) as each user allocated exec queue adds more hash table entries to lockdep. A pool old of 256 ordered wq should be enough to have similar behavior with and without lockdep enabled. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:43:39 -05:00
Bommithi Sakeena	28b1d9155c	drm/xe: Ensure mutex are destroyed Add missing mutex_destroy calls to fini functions or convert to drmm_mutex_init where fini function is not available. Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Bommithi Sakeena <bommithi.sakeena@intel.com> Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:41:21 -05:00
Daniele Ceraolo Spurio	cb90d46918	drm/xe: Add child contexts to the GuC context lookup The CAT_ERROR message from the GuC provides the guc id of the context that caused the problem, which can be a child context. We therefore need to be able to match that id to the exec_queue that owns it, which we do by adding child context to the context lookup. While at it, fix the error path of the guc id allocation code to correctly free the ids allocated for parallel queues. v2: rebase on s/XE_WARN_ON/xe_assert Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/590 Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:41:14 -05:00
Daniele Ceraolo Spurio	c4991ee01d	drm/xe/uc: Rename guc_submission_enabled() to uc_enabled() The guc_submission_enabled() function is being used as a boolean toggle for all firmwares and all related features, not just GuC submission. We could add additional flags/functions to distinguish and allow different use-cases (e.g. loading HuC but not using GuC submission), but given that not using GuC is a debug-only scenario having a global switch for all FWs is enough. However, we want to make it clear that this switch turns off everything, so rename it to uc_enabled(). v2: rebase on s/XE_WARN_ON/xe_assert Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:41:13 -05:00
Francois Dugast	c73acc1eeb	drm/xe: Use Xe assert macros instead of XE_WARN_ON macro The XE_WARN_ON macro maps to WARN_ON which is not justified in many cases where only a simple debug check is needed. Replace the use of the XE_WARN_ON macro with the new xe_assert macros which relies on drm_*. This takes a struct drm_device argument, which is one of the main changes in this commit. The other main change is that the condition is reversed, as with XE_WARN_ON a message is displayed if the condition is true, whereas with xe_assert it is if the condition is false. v2: - Rebase - Keep WARN splats in xe_wopcm.c (Matt Roper) v3: - Rebase Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:41:08 -05:00
Francois Dugast	5c0553cdc8	drm/xe: Replace XE_WARN_ON with drm_warn when just printing a string Use the generic drm_warn instead of the driver-specific XE_WARN_ON in cases where XE_WARN_ON is used to unconditionally print a debug message. v2: Rebase Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:41:07 -05:00
Daniele Ceraolo Spurio	923e423817	drm/xe: split kernel vs permanent engine flags If an engine is only destroyed on driver unload, we can skip its clean-up steps with the GuC because the GuC is going to be tuned off as well, so it doesn't matter if we're in sync with it or not. Currently, we apply this optimization to all engines marked as kernel, but this stops us to supporting kernel engines that don't stick around until unload. To remove this limitation, add a separate flag to indicate if the engine is expected to only be destryed on driver unload and use that to trigger the optimzation. While at it, add a small comment to explain what each engine flag represents. v2: s/XE_BUG_ON/XE_WARN_ON, s/ENGINE/EXEC_QUEUE v3: rebased Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20230822173334.1664332-3-daniele.ceraolospurio@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:40:27 -05:00
Daniele Ceraolo Spurio	1c66c0f391	drm/xe: fix submissions without vm Kernel queues can submit privileged batches directly in GGTT, so they don't always need a vm. The submission front-end already supports creating and submitting jobs without a vm, but some parts of the back-end assume the vm is always there. Fix this by handling a lack of vm in the back-end as well. v2: s/XE_BUG_ON/XE_WARN_ON, s/engine/exec_queue Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20230822173334.1664332-2-daniele.ceraolospurio@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:40:27 -05:00
Daniele Ceraolo Spurio	0b1d1473b3	drm/xe: common function to assign queue name The queue name assignment is identical in both GuC and execlists backends, so we can move it to a common function. This will make adding a new entry in the next patch slightly cleaner. Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20230817201831.1583172-2-daniele.ceraolospurio@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:40:20 -05:00
Matthew Brost	a20c75dba1	drm/xe: Call __guc_exec_queue_fini_async direct for KERNEL exec_queues Usually we call __guc_exec_queue_fini_async via a worker as the exec_queue fini can be done from within the GPU scheduler which creates a circular dependency without a worker. Kernel exec_queues are fini'd at driver unload (not from within the GPU scheduler) so it is safe to directly call __guc_exec_queue_fini_async. Suggested-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:40:19 -05:00
Matthew Auld	ef6ea97228	drm/xe/guc_submit: fixup deregister in job timeout Rather check if the engine is still registered before proceeding with deregister steps. Also the engine being marked as disabled doesn't mean the engine has been disabled or deregistered from GuC pov, and here we are signalling fences so we need to be sure GuC is not still using this context. v2: - Drop the read_stopped() for this path. Since we are signalling fences on error here, best play it safe and wait for the GT reset to mark the engine as disabled, rather than it just being queued. v3 (Matt Brost): - Keep the read_stopped() on the wait event, since there is no need to wait for an already scheduled GT reset. If it is set we can then just bail without signalling anything. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:40:19 -05:00
Matthew Auld	17d28aa8bd	drm/xe: don't warn for bogus pagefaults This appears to be easily user triggerable so warning is perhaps too much. Rather just make it debug print. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/534 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:40:19 -05:00
Matthew Auld	31b57683de	drm/xe/guc_submit: prevent repeated unregister It seems that various things can trigger the lr cleanup worker, including CAT error, engine reset and destroying the actual engine, so seems plausible to end up triggering the worker more than once in some cases. If that does happen we can race with an ongoing engine deregister before it has completed, thus triggering it again and also changing the state back into pending_disable. Checking if the engine has been marked as destroyed looks like it should prevent this. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:39:29 -05:00
Tejas Upadhyay	eef55700f3	drm/xe: Add sysfs for default engine scheduler properties For each HW engine under GT we are adding defaults sysfs entry to list all engine scheduler properties and its default values. So that it will be easier for user to fetch default values of these properties anytime to go back to default. For example, DUT# cat /sys/class/drm/card1/device/tileN/gtN/engines/bcs/.defaults/ job_timeout_ms preempt_timeout_us timeslice_duration_us where, @job_timeout_ms: The time after which a job is removed from the scheduler. @preempt_timeout_us: How long to wait (in microseconds) for a preemption event to occur when submitting a new context. @timeslice_duration_us: Each context is scheduled for execution for the timeslice duration, before switching to the next context. V12: - Add missing drmm_add_action_or_reset and remove sysfs files V11: - Rebase V10: - Remove xe_gt.h inclusion from .h - Matt V9 : - Remove jiffies for job_timeout_ms - Matt V8 : - replace xe_engine with xe_hw_engine - Matt V7 : - Push all errors to one error path at every places - Niranjana - Describe struct member to resolve kernel doc err - CI hooks V6 : - Use engine class interface instead of hw engine in sysfs for better interfacing readability - Niranjana V5 : - Scheduling props should apply per class engine not per hardware engine - Matt - Do not record value of job_timeout_ms if changed based on dma_fence - Matt V4 : - Resolve merge conflicts - CI V3 : - Rearrange code in its own file - Rebase - Update commit message to reflect tile addition V2 : - Use sysfs_create_files in this patch - Niranjana - Handle prototype error for xe_add_engine_defaults - CI hooks - Remove unused member sysfs_hwe - Niranjana Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:39:21 -05:00
Francois Dugast	9b9529ce37	drm/xe: Rename engine to exec_queue Engine was inappropriately used to refer to execution queues and it also created some confusion with hardware engines. Where it applies the exec_queue variable name is changed to q and comments are also updated. Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/162 Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:39:20 -05:00
Francois Dugast	c22a4ed0c3	drm/xe: Rename xe_engine.[ch] to xe_exec_queue.[ch] This is a preparation commit for a larger renaming of engine to exec queue. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:39:17 -05:00
Francois Dugast	99fea68288	drm/xe: Prefer WARN() over BUG() to avoid crashing the kernel Replace calls to XE_BUG_ON() with calls XE_WARN_ON() which in turn calls WARN() instead of BUG(). BUG() crashes the kernel and should only be used when it is absolutely unavoidable in case of catastrophic and unrecoverable failures, which is not the case here. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:39:17 -05:00
Matthew Brost	e3828ebf6c	drm/xe: Add define WQ_HEADER_SIZE Previously used a a magic '+ 3', use define instead. Suggested-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:39:16 -05:00
Francois Dugast	3e8e7ee6a3	drm/xe: Cleanup style warnings Reduce the number of warnings reported by checkpatch.pl from 118 to 48 by addressing those warnings types: LEADING_SPACE LINE_SPACING BRACES TRAILING_SEMICOLON CONSTANT_COMPARISON BLOCK_COMMENT_STYLE RETURN_VOID ONE_SEMICOLON SUSPECT_CODE_INDENT LINE_CONTINUATIONS UNNECESSARY_ELSE UNSPECIFIED_INT UNNECESSARY_INT MISORDERED_TYPE Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:31 -05:00
Francois Dugast	763931d25c	drm/xe: Cleanup CODE_INDENT style issues Remove all existing style issues of type CODE_INDENT reported by checkpatch. Signed-off-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>	2023-12-21 11:37:30 -05:00

1 2

62 commits