incident-runbook-templates

Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.

Author

Install

Hot:4

Download and extract to your skills directory

Copy command and send to OpenClaw for auto-install:

Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-incident-runbook-templates&locale=en&source=copy

Incident Runbook Templates

Skills Overview


Incident Runbook Templates provide production-ready incident response runbook templates that cover the full lifecycle—detection, classification, mitigation, resolution, and communication—helping SRE and operations teams quickly establish structured failure-handling procedures.

Use Cases

1. Create an Incident Response Procedure


When a team needs to establish a standardized troubleshooting workflow, this skill helps quickly generate a complete runbook covering detection, initial classification, mitigation steps, verification procedures, and communication templates.

2. Build Service-Specific Playbooks


Create tailored runbooks for specific services (e.g., payment processing, databases, API gateways), including service-specific health check commands, dependency checks, and rollback procedures.

3. On-call Engineer Onboarding Training


Provide structured troubleshooting guidance for newly joined on-call engineers, including incident severity tiers (SEV1–SEV4), an escalation matrix, and communication templates to help them make correct decisions under pressure.

4. Respond to Active Production Incidents


When incidents occur in production, provide immediate step-by-step handling procedures, including actionable commands for quick health checks, service rollback, dependency isolation, and horizontal scaling.

Core Features

1. Incident Severity Tier Standard


Provide an SEV1–SEV4 four-level incident classification framework. Each level defines the impact scope, response time, and handling priority, helping the team quickly assess how urgent the failure is and allocate the correct resources.

SeverityImpact ScopeResponse TimeTypical Scenarios
SEV1Full outage, data loss15 minutesProduction unavailable
SEV2Severe functional degradation30 minutesCore functionality fails
SEV3Minor impact2 hoursNon-critical functionality issues
SEV4Minimal impactNext workdayUI display problem

2. Service Outage Runbook Template


Provide a complete Kubernetes service incident handling workflow, including real commands for troubleshooting Pod crashes, deployment rollbacks, service scaling, and dependency isolation. The template covers four common failure patterns—full outage, high latency, partial failure, and sudden traffic spikes—each with directly executable diagnostic and mitigation commands.

# Example: quickly roll back to the last stable version
kubectl rollout undo deployment/payment-service -n payments
kubectl rollout status deployment/payment-service -n payments

3. Database Incident Handling Runbook


For common database issues such as connection pool exhaustion, replication lag, and insufficient disk space, provide ad-hoc SQL queries and diagnostic commands. This helps on-call engineers quickly identify slow queries, terminate blocking connections, evaluate replication lag, and decide whether to perform a failover.

4. Communication Template Library


Include three standardized communication template sets: initial incident notification (internal Slack), status updates, and resolution notifications. Each template set contains required fields (e.g., impact scope, current actions, estimated time to resolution) to ensure critical information is clearly communicated even under pressure.

5. Escalation Matrix Design


Provide a condition-driven escalation decision table that automatically matches escalation targets (engineering manager, security team, legal department, etc.) and contact methods based on factors such as failure duration, impact scope, and whether data security is involved.

Common Questions

What’s the difference between an Incident runbook and a regular operations document?


An Incident runbook is designed for high-pressure environments. It emphasizes executable commands and step-by-step decision-making rather than conceptual explanations. A good runbook should allow an engineer who gets called at 3:00 AM to copy-paste and run commands directly, without needing to consult other documents. It includes clear entry conditions (when to use), verification steps (confirming whether it worked), and rollback procedures (if things get worse).

How is a SEV1 incident determined?


A SEV1 incident typically meets any of the following conditions: core business is completely unavailable (e.g., payment functionality fails 100%), risk of data loss or leakage, impact to a large number of paying customers, or direct financial loss. After determination, a response team must be formed within 15 minutes, and status must be synchronized to management every 15 minutes. If you’re unsure of the severity, err on the side of caution—handle it as SEV1 first, then downgrade based on investigation results.

How do I start writing my first runbook?


Start with the most critical services and prioritize issues that have already occurred. Use the “service outage runbook template” provided by this skill as a starting point, filling in service-specific details (e.g., Kubernetes namespace, Pod labels, health check endpoints, dependency services). Then run a game day rehearsal: invite the team to simulate incident handling using the runbook, and record any steps that are unclear or missing information. After each real incident, update the runbook to keep it aligned with the production environment.

Who is this skill for?


  • SRE/Operations Engineers: build standardized incident handling workflows for the team

  • Platform Engineers: write customized playbooks for core services

  • On-call Engineers: as a quick reference guide for incident response

  • Engineering Managers: design team escalation matrices and communication standards

  • DevOps Engineers: integrate runbooks into automation tools (e.g., PagerDuty, Opsgenie)
  • What are the limitations of this skill?


    This skill provides templates and frameworks, not fully customized solutions for specific systems. You need to replace placeholders in the templates (e.g., [Service Name], [Slack Channel]) with real information, and adjust commands based on your team’s toolchain (Kubernetes, AWS, specific databases). The templates assume you use a standard cloud-native tool stack. If your environment differs significantly, you’ll need to validate command applicability yourself.

      Incident Runbook Templates - Production-Ready Incident Response Manual Templates - Open Skills