Available Metrics¶
The metrics exporter collects job-level and node-level performance data for active HPC jobs and exports them in
Prometheus line format. The available metrics depend on which profiles are enabled at startup via the command line
flags. All metrics are prefixed with keystone_ in Prometheus output.
System Metrics¶
Job-Level Metrics (--sys-job)¶
The system job profiler collects metrics aggregated per job step. Per-process values are summed across all processes belonging to a given job and step.
| Metric Name | Type | Description |
|---|---|---|
cpu_percent |
Gauge | Total CPU utilization across all job processes. |
memory_percent |
Gauge | Total resident memory usage as a percentage of total RAM. |
num_threads |
Gauge | Total number of threads across all job processes. |
open_files |
Gauge | Total number of open file descriptors across all job processes. |
Node-Level Metrics (--sys-node)¶
The system node profiler collects host-wide metrics for the entire compute node.
| Metric Name | Type | Description |
|---|---|---|
node_cpu_percent |
Gauge | Overall CPU utilization percentage. |
node_memory_total_bytes |
Gauge | Total physical memory in bytes. |
node_memory_used_bytes |
Gauge | Used physical memory in bytes. |
node_memory_percent |
Gauge | Physical memory usage as a percentage. |
node_swap_total_bytes |
Gauge | Total swap space in bytes. |
node_swap_used_bytes |
Gauge | Used swap space in bytes. |
node_swap_percent |
Gauge | Swap usage as a percentage. |
node_disk_read_bytes |
Gauge | Cumulative bytes read from disk. |
node_disk_write_bytes |
Gauge | Cumulative bytes written to disk. |
node_net_bytes_sent |
Gauge | Cumulative bytes sent over the network. |
node_net_bytes_recv |
Gauge | Cumulative bytes received over the network. |
Nvidia GPU Metrics¶
Job-Level Metrics (--nvidia-job)¶
The Nvidia job profiler collects GPU metrics aggregated per job step. Values are summed across all GPU processes belonging to a given job and step, across all devices.
| Metric Name | Type | Description |
|---|---|---|
gpu_memory_bytes |
Gauge | Total GPU memory allocated by the job in bytes. |
gpu_memory_percent |
Gauge | GPU memory usage as a percentage of device total. |
gpu_sm_utilization |
Gauge | Streaming multiprocessor utilization percentage. |
gpu_memory_utilization |
Gauge | Memory controller utilization percentage. |
gpu_encoder_utilization |
Gauge | Video encoder utilization percentage. |
gpu_decoder_utilization |
Gauge | Video decoder utilization percentage. |
Node-Level Metrics (--nvidia-node)¶
The Nvidia node profiler collects device-level metrics for all GPUs on the node.
| Metric Name | Type | Description |
|---|---|---|
node_gpu_utilization |
Gauge | Overall GPU core utilization percentage. |
node_gpu_memory_utilization |
Gauge | Overall memory controller utilization. |
node_gpu_temperature_celsius |
Gauge | GPU temperature in degrees Celsius. |
node_gpu_fan_speed_percent |
Gauge | Fan speed as a percentage of maximum. |
node_gpu_power_usage_watts |
Gauge | Current power draw in watts. |
node_gpu_memory_total_bytes |
Gauge | Total GPU memory in bytes. |
node_gpu_memory_used_bytes |
Gauge | Used GPU memory in bytes. |
node_gpu_memory_free_bytes |
Gauge | Free GPU memory in bytes. |