incident-runbook-templates
Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.
Author
Category
Document ProcessingInstall
Hot:4
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-incident-runbook-templates&locale=en&source=copy
Incident Runbook Templates
Skills Overview
Incident Runbook Templates provide production-ready incident response runbook templates that cover the full lifecycle—detection, classification, mitigation, resolution, and communication—helping SRE and operations teams quickly establish structured failure-handling procedures.
Use Cases
1. Create an Incident Response Procedure
When a team needs to establish a standardized troubleshooting workflow, this skill helps quickly generate a complete runbook covering detection, initial classification, mitigation steps, verification procedures, and communication templates.
2. Build Service-Specific Playbooks
Create tailored runbooks for specific services (e.g., payment processing, databases, API gateways), including service-specific health check commands, dependency checks, and rollback procedures.
3. On-call Engineer Onboarding Training
Provide structured troubleshooting guidance for newly joined on-call engineers, including incident severity tiers (SEV1–SEV4), an escalation matrix, and communication templates to help them make correct decisions under pressure.
4. Respond to Active Production Incidents
When incidents occur in production, provide immediate step-by-step handling procedures, including actionable commands for quick health checks, service rollback, dependency isolation, and horizontal scaling.
Core Features
1. Incident Severity Tier Standard
Provide an SEV1–SEV4 four-level incident classification framework. Each level defines the impact scope, response time, and handling priority, helping the team quickly assess how urgent the failure is and allocate the correct resources.
| Severity | Impact Scope | Response Time | Typical Scenarios |
|---|---|---|---|
| SEV1 | Full outage, data loss | 15 minutes | Production unavailable |
| SEV2 | Severe functional degradation | 30 minutes | Core functionality fails |
| SEV3 | Minor impact | 2 hours | Non-critical functionality issues |
| SEV4 | Minimal impact | Next workday | UI display problem |
2. Service Outage Runbook Template
Provide a complete Kubernetes service incident handling workflow, including real commands for troubleshooting Pod crashes, deployment rollbacks, service scaling, and dependency isolation. The template covers four common failure patterns—full outage, high latency, partial failure, and sudden traffic spikes—each with directly executable diagnostic and mitigation commands.
# Example: quickly roll back to the last stable version
kubectl rollout undo deployment/payment-service -n payments
kubectl rollout status deployment/payment-service -n payments3. Database Incident Handling Runbook
For common database issues such as connection pool exhaustion, replication lag, and insufficient disk space, provide ad-hoc SQL queries and diagnostic commands. This helps on-call engineers quickly identify slow queries, terminate blocking connections, evaluate replication lag, and decide whether to perform a failover.
4. Communication Template Library
Include three standardized communication template sets: initial incident notification (internal Slack), status updates, and resolution notifications. Each template set contains required fields (e.g., impact scope, current actions, estimated time to resolution) to ensure critical information is clearly communicated even under pressure.
5. Escalation Matrix Design
Provide a condition-driven escalation decision table that automatically matches escalation targets (engineering manager, security team, legal department, etc.) and contact methods based on factors such as failure duration, impact scope, and whether data security is involved.
Common Questions
What’s the difference between an Incident runbook and a regular operations document?
An Incident runbook is designed for high-pressure environments. It emphasizes executable commands and step-by-step decision-making rather than conceptual explanations. A good runbook should allow an engineer who gets called at 3:00 AM to copy-paste and run commands directly, without needing to consult other documents. It includes clear entry conditions (when to use), verification steps (confirming whether it worked), and rollback procedures (if things get worse).
How is a SEV1 incident determined?
A SEV1 incident typically meets any of the following conditions: core business is completely unavailable (e.g., payment functionality fails 100%), risk of data loss or leakage, impact to a large number of paying customers, or direct financial loss. After determination, a response team must be formed within 15 minutes, and status must be synchronized to management every 15 minutes. If you’re unsure of the severity, err on the side of caution—handle it as SEV1 first, then downgrade based on investigation results.
How do I start writing my first runbook?
Start with the most critical services and prioritize issues that have already occurred. Use the “service outage runbook template” provided by this skill as a starting point, filling in service-specific details (e.g., Kubernetes namespace, Pod labels, health check endpoints, dependency services). Then run a game day rehearsal: invite the team to simulate incident handling using the runbook, and record any steps that are unclear or missing information. After each real incident, update the runbook to keep it aligned with the production environment.
Who is this skill for?
What are the limitations of this skill?
This skill provides templates and frameworks, not fully customized solutions for specific systems. You need to replace placeholders in the templates (e.g.,
[Service Name], [Slack Channel]) with real information, and adjust commands based on your team’s toolchain (Kubernetes, AWS, specific databases). The templates assume you use a standard cloud-native tool stack. If your environment differs significantly, you’ll need to validate command applicability yourself.