Skip to content

Metrics & observability overview

rcfg-sim is built to be observed under load. Each server instance exposes Prometheus metrics and a health endpoint over HTTP.

--metrics-addr (default 0.0.0.0:9100) serves:

  • GET /metrics — Prometheus exposition format
  • GET /healthz — liveness check

Set --metrics-addr "" to disable the HTTP server.

Terminal window
curl -s http://127.0.0.1:9100/metrics | grep -E '^rcfgsim_'
curl -s http://127.0.0.1:9100/healthz

Every metric label is a closed set — pre-registered at startup, never derived from raw user input. A client typing a thousand distinct garbage commands does not create a thousand label values: it all rolls up under CmdUnknown. This keeps Prometheus healthy even when the simulator is abused, and it’s asserted by a cardinality test in the project (don’t add new labels casually — labels are part of the public API).

Because series are pre-registered at zero, the full metric set appears on the very first scrape — so you can alert on absence of traffic, not just presence.

  • rcfgsim_active_sessions — how close you are to --max-concurrent-sessions.
  • rcfgsim_sessions_total{result} — throughput and error mix (ok vs auth_fail / disconnect / error).
  • rcfgsim_command_duration_seconds — per-command latency, including delays and slow_response faults.
  • rcfgsim_bytes_sent_total — aggregate throughput.
  • rcfgsim_faults_injected_total{type} — confirm faults fire at the expected rate.

The full list, with labels and histogram buckets, is in the metrics reference. Ready-to-paste queries are in Grafana queries.

Alongside the simulator metrics, the standard Go runtime collectors (goroutines, memory, GC) and process collectors (CPU, open file descriptors) are registered — useful for watching the host’s resource envelope as you scale up.