The scenario we simulate
Once per month we trigger a “triple choke” rehearsal:
- Primary CEX API starts returning stale quotes for exactly 90 seconds.
- Our Layer-2 relayer queue grows past 150 pending writes.
- Discord/Slack webhooks hang, forcing traders to rely purely on PagerDuty alerts.
If we can still flatten and re-open core perp positions inside four minutes, the drill passes.
Runbook snapshot
- T-minus 0s — automation tags every live strategy with
latency-drill. Positions with less than 8 bps of edge are flattened immediately. - +30s — fail over pricing to our backup colo in Zürich; this includes refreshing auth tokens and wiping all local caches.
- +60s — reroute settlement to the “slow path” bridge that uses batched proofs; yes it is expensive, but it is deterministic.
- +90s — trading leads acknowledge the drill inside PagerDuty; if acknowledgement is missing, ops has authority to liquidate non-critical strategies.
Metrics we watch
| Metric | Target | Alert |
|---|---|---|
| Synthetic spread error | < 4 bps | 6 bps |
| Relayer queue depth | < 80 | 120 |
| Bridge confirmation time | < 140s | 180s |
| Human acknowledgement | < 2 min | 3 min |
Lessons learned so far
- Our biggest enemy is muscle memory. Having the runbook printed next to each desk reduced “what do I do now?” delays by half.
- We now keep a tiny amount of stablecoins in a CeFi wallet with pre-approved withdrawal addresses so we can source liquidity even if on-chain rails clog.
- Observability matters more than throughput. The custom Grafana board we built for the drill has saved us twice in production already.
If you want a copy of the checklist we run through, grab it in /docs/drills/latency-failover.md inside the SwipeX vault.
