Beta
Maintenance Playbooks
Failure-response drills. Timed, role-assigned rehearsals with a script and a timer. When a P1 hits, the team has already practiced the handoff three times.
Start here
Run your first drill
Pick one scenario that maps to your last real incident. Roles, timers, script — run through it with the people closest to response.
- Pick a scenario that reflects your real outage or escalation risks.
- Assign steward roles and timebox each rehearsal segment.
- Capture gaps and update the playbook after the drill.
Playbook structure
Every playbook has four sections: a scenario trigger, a timed role script, a handoff protocol, and a post-drill capture template. No filler.
- Scenario trigger — the exact conditions that start the clock. Example: “Primary on-call does not acknowledge a P1 alert within 3 minutes.”
- Role script — what each named person does and in what order. Timed to the minute.
- Handoff protocol — how command transfers when someone hits their limit or gets pulled into another incident.
- Post-drill log — a one-page template for decisions, gaps, and next steps.
Sample: Steward rotation playbook excerpt
Scenario: Primary on-call is unreachable during a P1 incident. Timer: 8 minutes to establish command. Roles: Incident commander, comms lead, engineering lead. Handoff protocol:
- Secondary assumes command after 8 min without primary acknowledgement.
- Comms lead logs the ownership change in the incident channel.
- Post-incident review captures the gap that caused the miss.
Playbooks currently available
- Steward rotation: Practice handing off live services without losing context or missing recovery cues.
- Pause and rollback: Drill the conditions that justify slowing or reversing a release while protecting affected people.
- Care escalation: Run scenarios that center operator wellbeing and shared responsibility for impact.
Scenario library
- Model drift week: Simulate escalating false positives, rising support volume, and stakeholder pressure to keep shipping.
- Dependency outage: Rehearse what happens when a third-party model or API degrades without warning.
- Policy change deadline: Practice adapting maintenance routines when legal or regulatory obligations shift mid-quarter.
What we measure in a drill
- Coordination latency: Minutes between first alert and cross-team acknowledgement.
- Escalation quality: Did the incident notes include clear owner transitions and decision rationale.
- Recovery burden: Overtime, manual workarounds, and repeat incidents after the drill.
Typical outputs
- Playbook pack: Editable templates for each drill with roles, timers, and escalation cues.
- Facilitation checklist: A run-of-show guide for operators leading the rehearsal.
- Post-drill log: A lightweight record format for decisions, gaps, and next steps.
Engagement length
- 1–2 weeks to tailor playbooks to one service or team.
- 3–5 weeks to adapt multiple drills and align cross-team handoffs.
- Pick a playbook that matches your service and stakes.
- Assign roles and run through the script, noting gaps in instrumentation or policy.
- Capture adjustments and rerun with the updated checklist to confirm the changes hold.
Book a session to tailor the drills to your team and incident play history.