Install and Setup¶
Each compute node in the cluster requires its own instance of the Keystone metrics exporter. Once running, the exporter monitors active workloads and exposes performance metrics via an HTTP endpoint compatible with Prometheus.
Installation¶
The keystone-exporter package is available in the Better HPC package repository and can be installed using pipx:
BHPC_REPO="https://dl.cloudsmith.io/public/better-hpc/keystone/python/simple/"
pipx install --extra-index-url=$BHPC_REPO --include-deps keystone-exporter
Running the Agent¶
The metrics exporter includes multiple profilers, each providing a different set of Prometheus metrics. Launching the metrics agent requires specifying which system profilers to enable. Note that not all profilers are supported on every machine. For example, the NVIDIA profilers are not supported on machines without NVIDIA GPUs.
The following example starts the exporter with all available profilers. For a complete list of metrics exposed by each profiler, see the metrics documentation.
keystone-exporter --sys-node --sys-job --nvidia-node --nvidia-job
By default, metrics are exposed on 127.0.0.1:9105.
This behavior can be customized using the --host and --port options:
keystone-exporter --host 0.0.0.0 --port 9200 --sys-job
Logging¶
The exporter writes its log output to syslog by default.
Console logging can also be enabled using the --debug flag:
keystone-exporter --debug --sys-job