← Back to blog

Feb 5, 2025 · 1 min read

Latency Failover Drill

How we rehearse kill-switches when CEX APIs and Layer-2 sequencers stall at the same time.

Nadia Korkmaz
Latency Failover Drill

The scenario we simulate

Once per month we trigger a “triple choke” rehearsal:

  1. Primary CEX API starts returning stale quotes for exactly 90 seconds.
  2. Our Layer-2 relayer queue grows past 150 pending writes.
  3. Discord/Slack webhooks hang, forcing traders to rely purely on PagerDuty alerts.

If we can still flatten and re-open core perp positions inside four minutes, the drill passes.

Runbook snapshot

  • T-minus 0s — automation tags every live strategy with latency-drill. Positions with less than 8 bps of edge are flattened immediately.
  • +30s — fail over pricing to our backup colo in Zürich; this includes refreshing auth tokens and wiping all local caches.
  • +60s — reroute settlement to the “slow path” bridge that uses batched proofs; yes it is expensive, but it is deterministic.
  • +90s — trading leads acknowledge the drill inside PagerDuty; if acknowledgement is missing, ops has authority to liquidate non-critical strategies.

Metrics we watch

Metric Target Alert
Synthetic spread error < 4 bps 6 bps
Relayer queue depth < 80 120
Bridge confirmation time < 140s 180s
Human acknowledgement < 2 min 3 min

Lessons learned so far

  • Our biggest enemy is muscle memory. Having the runbook printed next to each desk reduced “what do I do now?” delays by half.
  • We now keep a tiny amount of stablecoins in a CeFi wallet with pre-approved withdrawal addresses so we can source liquidity even if on-chain rails clog.
  • Observability matters more than throughput. The custom Grafana board we built for the drill has saved us twice in production already.

If you want a copy of the checklist we run through, grab it in /docs/drills/latency-failover.md inside the SwipeX vault.