Skip to content

Provide "zello_sysman" tool with binary releases #787

Open
@eero-t

Description

@eero-t

Users want some tool to monitor their GPUs.

Currently there are no good options:

  • XPUM supports officially (is tested with) only data center GPUs, and as result, its:
    • Binary releases rely on Intel repo packages which can conflict with distro packages
    • Latest container release is too old to support Xe / latest GPUs
  • Collect v6-RC includes Sysman plugin, but has no final release, nor binary releases
    • And it's development has completely stalled
  • Neither of them is included to any distro

To fix that, I'm proposing zello_sysman binary to be installed when compute-runtime is built, and it to be included to its release packages. That way it should eventually be available also in the distros.

While its output is not as nicely layed out as xpu-smi one, it does provide all the available metrics from L0 backend.

There are few things that could be done to productize it better for end users:

  • Add manual page (I could help with that)
  • Change help output a bit to indicate that it outputs metrics
    • e.g. selectively run fan black box test -> run fan tests and provide resulting metrics
  • Maybe rename as ze_sysman_tool or something

(It's source code is not that large, so one option could also be including it to doc/ dir as L0 usage example.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions