Metrics & observability overview

rcfg-sim is built to be observed under load. Each server instance exposes Prometheus metrics and a health endpoint over HTTP.

The endpoints

--metrics-addr (default 0.0.0.0:9100) serves:

GET /metrics — Prometheus exposition format
GET /healthz — liveness check

Set --metrics-addr "" to disable the HTTP server.

curl -s http://127.0.0.1:9100/metrics | grep -E '^rcfgsim_'
curl -s http://127.0.0.1:9100/healthz

Bounded cardinality by design

Every metric label is a closed set — pre-registered at startup, never derived from raw user input. A client typing a thousand distinct garbage commands does not create a thousand label values: it all rolls up under CmdUnknown. This keeps Prometheus healthy even when the simulator is abused, and it’s asserted by a cardinality test in the project (don’t add new labels casually — labels are part of the public API).

Because series are pre-registered at zero, the full metric set appears on the very first scrape — so you can alert on absence of traffic, not just presence.

What to watch under load

rcfgsim_active_sessions — how close you are to --max-concurrent-sessions.
rcfgsim_sessions_total{result} — throughput and error mix (ok vs auth_fail / disconnect / error).
rcfgsim_command_duration_seconds — per-command latency, including delays and slow_response faults.
rcfgsim_bytes_sent_total — aggregate throughput.
rcfgsim_faults_injected_total{type} — confirm faults fire at the expected rate.

The full list, with labels and histogram buckets, is in the metrics reference. Ready-to-paste queries are in Grafana queries.

Runtime and process metrics

Alongside the simulator metrics, the standard Go runtime collectors (goroutines, memory, GC) and process collectors (CPU, open file descriptors) are registered — useful for watching the host’s resource envelope as you scale up.