How often should we run the drill?

Monthly at minimum; more often if you ship infra changes or saw recent incidents.

What counts as a pass?

Flatten and re-open core perps within four minutes while keeping spread error and queue depth inside targets.

Anything with <8 bps edge or tagged non-critical; critical strategies fail over but stay hedged.

Ops owns the decision to flatten non-critical strategies after the 90s checkpoint.

Yes—the drill stacks CEX API stalls, relayer backlog, and comms failure to reflect real outage clusters.

Feb 5, 2025 · 1 min read

How we rehearse kill-switches when CEX APIs and Layer-2 sequencers stall at the same time.

Nadia Korkmaz

Once per month we trigger a “triple choke” rehearsal:

Primary CEX API starts returning stale quotes for exactly 90 seconds.
Our Layer-2 relayer queue grows past 150 pending writes.
Discord/Slack webhooks hang, forcing traders to rely purely on PagerDuty alerts.

If we can still flatten and re-open core perp positions inside four minutes, the drill passes.

T-minus 0s — automation tags every live strategy with latency-drill. Positions with less than 8 bps of edge are flattened immediately.
+30s — fail over pricing to our backup colo in Zürich; this includes refreshing auth tokens and wiping all local caches.
+60s — reroute settlement to the “slow path” bridge that uses batched proofs; yes it is expensive, but it is deterministic.
+90s — trading leads acknowledge the drill inside PagerDuty; if acknowledgement is missing, ops has authority to liquidate non-critical strategies.

Our biggest enemy is muscle memory. Having the runbook printed next to each desk reduced “what do I do now?” delays by half.
We now keep a tiny amount of stablecoins in a CeFi wallet with pre-approved withdrawal addresses so we can source liquidity even if on-chain rails clog.
Observability matters more than throughput. The custom Grafana board we built for the drill has saved us twice in production already.

If you want a copy of the checklist we run through, grab it in /docs/drills/latency-failover.md inside the SwipeX vault.