Skip to content

Available Metrics

The metrics exporter collects job-level and node-level performance data for active HPC jobs and exports them in Prometheus line format. The available metrics depend on which profiles are enabled at startup via the command line flags. All metrics are prefixed with keystone_ in Prometheus output.

System Metrics

Job-Level Metrics (--sys-job)

The system job profiler collects metrics aggregated per job step. Per-process values are summed across all processes belonging to a given job and step.

Metric Name Type Description
cpu_percent Gauge Total CPU utilization across all job processes.
memory_percent Gauge Total resident memory usage as a percentage of total RAM.
num_threads Gauge Total number of threads across all job processes.
open_files Gauge Total number of open file descriptors across all job processes.

Node-Level Metrics (--sys-node)

The system node profiler collects host-wide metrics for the entire compute node.

Metric Name Type Description
node_cpu_percent Gauge Overall CPU utilization percentage.
node_memory_total_bytes Gauge Total physical memory in bytes.
node_memory_used_bytes Gauge Used physical memory in bytes.
node_memory_percent Gauge Physical memory usage as a percentage.
node_swap_total_bytes Gauge Total swap space in bytes.
node_swap_used_bytes Gauge Used swap space in bytes.
node_swap_percent Gauge Swap usage as a percentage.
node_disk_read_bytes Gauge Cumulative bytes read from disk.
node_disk_write_bytes Gauge Cumulative bytes written to disk.
node_net_bytes_sent Gauge Cumulative bytes sent over the network.
node_net_bytes_recv Gauge Cumulative bytes received over the network.

Nvidia GPU Metrics

Job-Level Metrics (--nvidia-job)

The Nvidia job profiler collects GPU metrics aggregated per job step. Values are summed across all GPU processes belonging to a given job and step, across all devices.

Metric Name Type Description
gpu_memory_bytes Gauge Total GPU memory allocated by the job in bytes.
gpu_memory_percent Gauge GPU memory usage as a percentage of device total.
gpu_sm_utilization Gauge Streaming multiprocessor utilization percentage.
gpu_memory_utilization Gauge Memory controller utilization percentage.
gpu_encoder_utilization Gauge Video encoder utilization percentage.
gpu_decoder_utilization Gauge Video decoder utilization percentage.

Node-Level Metrics (--nvidia-node)

The Nvidia node profiler collects device-level metrics for all GPUs on the node.

Metric Name Type Description
node_gpu_utilization Gauge Overall GPU core utilization percentage.
node_gpu_memory_utilization Gauge Overall memory controller utilization.
node_gpu_temperature_celsius Gauge GPU temperature in degrees Celsius.
node_gpu_fan_speed_percent Gauge Fan speed as a percentage of maximum.
node_gpu_power_usage_watts Gauge Current power draw in watts.
node_gpu_memory_total_bytes Gauge Total GPU memory in bytes.
node_gpu_memory_used_bytes Gauge Used GPU memory in bytes.
node_gpu_memory_free_bytes Gauge Free GPU memory in bytes.