Skip to content

Load-test scenarios

A staged approach to scaling up. Each rung validates something before you commit to the next; the repo’s TEST-SCENARIOS.md covers the full progression in depth.

100 → 1,000 → 5,000 → 25,000 → 50,000
smoke single IP multi-IP half fleet full fleet

Prove the toolchain end to end: generate, serve, connect, scrape. This is the Quickstart. Goal: a clean session and moving metrics.

One IP, 1,000 ports. Point your collector at the range and run a full backup cycle. Watch rcfgsim_active_sessions and the session error mix. Goal: no errors, latency within expectation.

Move to systemd instances and IP aliases — e.g. 2 IPs × 2,500 ports. This validates that your tooling handles many addresses and that per-instance drain works.

10 IPs × 2,500 ports. Now you’re exercising scheduler fairness and burst handling: fire all backups in a tight window and watch for queueing, timeouts, and resource pressure on both sides. Introduce faults at a low rate here.

20 IPs × 2,500 ports on the reference host (12 vCPU / 48 GB). This is the production-shaped test: a realistic size distribution, faults at a realistic rate, and a full collection cycle. Goal: characterize throughput, tail latency, and the resource envelope of the system under test.

  • Throughputrate(rcfgsim_sessions_total{result="ok"}[1m])
  • Error mix — non-ok results as a share of total
  • Tail latency — p95/p99 of rcfgsim_command_duration_seconds
  • Concurrency headroomrcfgsim_active_sessions vs your cap
  • Host envelope — Go/process collectors (goroutines, FDs, CPU, memory)

See Grafana queries for the exact PromQL.

  • Keep the same host key across restarts so clients don’t see host-key-changed warnings.
  • Use a fixed --seed so each run faces the identical fleet.
  • Restart instances with systemctl restart to get a graceful drain rather than killing them.