Xe Device Coredump

Xe uses dev_coredump infrastructure for exposing the crash errors in a standardized way. Once a crash occurs, devcoredump exposes a temporary node under /sys/class/devcoredump/devcd<m>/. The same node is also accessible in /sys/class/drm/card<n>/device/devcoredump/. The failing_device symlink points to the device that crashed and created the coredump.

The following characteristics are observed by xe when creating a device coredump:

Snapshot at hang:

The ‘data’ file contains a snapshot of the HW and driver states at the time the hang happened. Due to the driver recovering from resets/crashes, it may not correspond to the state of the system when the file is read by userspace.

Coredump release:

After a coredump is generated, it stays in kernel memory until released by userpace by writing anything to it, or after an internal timer expires. The exact timeout may vary and should not be relied upon. Example to release a coredump:

$ > /sys/class/drm/card0/device/devcoredump/data
First failure only:

In general, the first hang is the most critical one since the following hangs can be a consequence of the initial hang. For this reason a snapshot is taken only for the first failure. Until the devcoredump is released by userspace or kernel, all subsequent hangs do not override the snapshot nor create new ones. Devcoredump has a delayed work queue that will eventually delete the file node and free all the dump information.

Internal API

void xe_devcoredump(struct xe_exec_queue *q, struct xe_sched_job *job, const char *fmt, ...)

Take the required snapshots and initialize coredump device.

Parameters

struct xe_exec_queue *q

The faulty xe_exec_queue, where the issue was detected.

struct xe_sched_job *job

The faulty xe_sched_job, where the issue was detected.

const char *fmt

Printf format + args to describe the reason for the core dump

...

variable arguments

Description

This function should be called at the crash time within the serialized gt_reset. It is skipped if we still have the core dump device available with the information of the ‘first’ snapshot.

void xe_print_blob_ascii85(struct drm_printer *p, const char *prefix, const void *blob, size_t offset, size_t size)

print a BLOB to some useful location in ASCII85

Parameters

struct drm_printer *p

the printer object to output to

const char *prefix

optional prefix to add to output string

const void *blob

the Binary Large OBject to dump out

size_t offset

offset in bytes to skip from the front of the BLOB, must be a multiple of sizeof(u32)

size_t size

the size in bytes of the BLOB, must be a multiple of sizeof(u32)

Description

The output is split to multiple lines because some print targets, e.g. dmesg cannot handle arbitrarily long lines. Note also that printing to dmesg in piece-meal fashion is not possible, each separate call to drm_puts() has a line-feed automatically added! Therefore, the entire output line must be constructed in a local buffer first, then printed in one atomic output call.

There is also a scheduler yield call to prevent the ‘task has been stuck for 120s’ kernel hang check feature from firing when printing to a slow target such as dmesg over a serial port.

TODO: Add compression prior to the ASCII85 encoding to shrink huge buffers down.