Xe Device Coredump¶
Xe uses dev_coredump infrastructure for exposing the crash errors in a
standardized way. Once a crash occurs, devcoredump exposes a temporary
node under /sys/class/devcoredump/devcd<m>/
. The same node is also
accessible in /sys/class/drm/card<n>/device/devcoredump/
. The
failing_device
symlink points to the device that crashed and created the
coredump.
The following characteristics are observed by xe when creating a device coredump:
- Snapshot at hang:
The ‘data’ file contains a snapshot of the HW and driver states at the time the hang happened. Due to the driver recovering from resets/crashes, it may not correspond to the state of the system when the file is read by userspace.
- Coredump release:
After a coredump is generated, it stays in kernel memory until released by userspace by writing anything to it, or after an internal timer expires. The exact timeout may vary and should not be relied upon. Example to release a coredump:
$ > /sys/class/drm/card0/device/devcoredump/data
- First failure only:
In general, the first hang is the most critical one since the following hangs can be a consequence of the initial hang. For this reason a snapshot is taken only for the first failure. Until the devcoredump is released by userspace or kernel, all subsequent hangs do not override the snapshot nor create new ones. Devcoredump has a delayed work queue that will eventually delete the file node and free all the dump information.
Internal API¶
-
void xe_devcoredump(struct xe_exec_queue *q, struct xe_sched_job *job, const char *fmt, ...)¶
Take the required snapshots and initialize coredump device.
Parameters
struct xe_exec_queue *q
The faulty xe_exec_queue, where the issue was detected.
struct xe_sched_job *job
The faulty xe_sched_job, where the issue was detected.
const char *fmt
Printf format + args to describe the reason for the core dump
...
variable arguments
Description
This function should be called at the crash time within the serialized gt_reset. It is skipped if we still have the core dump device available with the information of the ‘first’ snapshot.
-
void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix, const void *blob, size_t offset, size_t size)¶
print a BLOB to some useful location in ASCII85
Parameters
struct drm_printer *p
the printer object to output to
const char *prefix
optional prefix to add to output string
const void *blob
the Binary Large OBject to dump out
size_t offset
offset in bytes to skip from the front of the BLOB, must be a multiple of sizeof(u32)
size_t size
the size in bytes of the BLOB, must be a multiple of sizeof(u32)
Description
The output is split to multiple lines because some print targets, e.g. dmesg
cannot handle arbitrarily long lines. Note also that printing to dmesg in
piece-meal fashion is not possible, each separate call to drm_puts()
has a
line-feed automatically added! Therefore, the entire output line must be
constructed in a local buffer first, then printed in one atomic output call.
There is also a scheduler yield call to prevent the ‘task has been stuck for 120s’ kernel hang check feature from firing when printing to a slow target such as dmesg over a serial port.
TODO: Add compression prior to the ASCII85 encoding to shrink huge buffers down.