Beta

Maintenance Playbooks

Failure-response drills. Timed, role-assigned rehearsals with a script and a timer. When a P1 hits, the team has already practiced the handoff three times.

Start here

Run your first drill

Pick one scenario that maps to your last real incident. Roles, timers, script — run through it with the people closest to response.

  • Pick a scenario that reflects your real outage or escalation risks.
  • Assign steward roles and timebox each rehearsal segment.
  • Capture gaps and update the playbook after the drill.

Playbook structure

Every playbook has four sections: a scenario trigger, a timed role script, a handoff protocol, and a post-drill capture template. No filler.

  • Scenario trigger — the exact conditions that start the clock. Example: “Primary on-call does not acknowledge a P1 alert within 3 minutes.”
  • Role script — what each named person does and in what order. Timed to the minute.
  • Handoff protocol — how command transfers when someone hits their limit or gets pulled into another incident.
  • Post-drill log — a one-page template for decisions, gaps, and next steps.

Sample: Steward rotation playbook excerpt

Scenario: Primary on-call is unreachable during a P1 incident. Timer: 8 minutes to establish command. Roles: Incident commander, comms lead, engineering lead. Handoff protocol:

  1. Secondary assumes command after 8 min without primary acknowledgement.
  2. Comms lead logs the ownership change in the incident channel.
  3. Post-incident review captures the gap that caused the miss.

Playbooks currently available

  • Steward rotation: Practice handing off live services without losing context or missing recovery cues.
  • Pause and rollback: Drill the conditions that justify slowing or reversing a release while protecting affected people.
  • Care escalation: Run scenarios that center operator wellbeing and shared responsibility for impact.

Scenario library

  • Model drift week: Simulate escalating false positives, rising support volume, and stakeholder pressure to keep shipping.
  • Dependency outage: Rehearse what happens when a third-party model or API degrades without warning.
  • Policy change deadline: Practice adapting maintenance routines when legal or regulatory obligations shift mid-quarter.

What we measure in a drill

  • Coordination latency: Minutes between first alert and cross-team acknowledgement.
  • Escalation quality: Did the incident notes include clear owner transitions and decision rationale.
  • Recovery burden: Overtime, manual workarounds, and repeat incidents after the drill.

Typical outputs

  • Playbook pack: Editable templates for each drill with roles, timers, and escalation cues.
  • Facilitation checklist: A run-of-show guide for operators leading the rehearsal.
  • Post-drill log: A lightweight record format for decisions, gaps, and next steps.

Engagement length

  • 1–2 weeks to tailor playbooks to one service or team.
  • 3–5 weeks to adapt multiple drills and align cross-team handoffs.
  1. Pick a playbook that matches your service and stakes.
  2. Assign roles and run through the script, noting gaps in instrumentation or policy.
  3. Capture adjustments and rerun with the updated checklist to confirm the changes hold.

Book a session to tailor the drills to your team and incident play history.