Incident Runbook Templates

Skills Overview

Incident Runbook Templates provide production-ready incident response runbook templates that cover the full lifecycle—detection, classification, mitigation, resolution, and communication—helping SRE and operations teams quickly establish structured failure-handling procedures.

Use Cases

1. Create an Incident Response Procedure

When a team needs to establish a standardized troubleshooting workflow, this skill helps quickly generate a complete runbook covering detection, initial classification, mitigation steps, verification procedures, and communication templates.

2. Build Service-Specific Playbooks

Create tailored runbooks for specific services (e.g., payment processing, databases, API gateways), including service-specific health check commands, dependency checks, and rollback procedures.

3. On-call Engineer Onboarding Training

Provide structured troubleshooting guidance for newly joined on-call engineers, including incident severity tiers (SEV1–SEV4), an escalation matrix, and communication templates to help them make correct decisions under pressure.

4. Respond to Active Production Incidents

When incidents occur in production, provide immediate step-by-step handling procedures, including actionable commands for quick health checks, service rollback, dependency isolation, and horizontal scaling.

Core Features

1. Incident Severity Tier Standard

Provide an SEV1–SEV4 four-level incident classification framework. Each level defines the impact scope, response time, and handling priority, helping the team quickly assess how urgent the failure is and allocate the correct resources.

Severity	Impact Scope	Response Time	Typical Scenarios
SEV1	Full outage, data loss	15 minutes	Production unavailable
SEV2	Severe functional degradation	30 minutes	Core functionality fails
SEV3	Minor impact	2 hours	Non-critical functionality issues
SEV4	Minimal impact	Next workday	UI display problem

2. Service Outage Runbook Template

Provide a complete Kubernetes service incident handling workflow, including real commands for troubleshooting Pod crashes, deployment rollbacks, service scaling, and dependency isolation. The template covers four common failure patterns—full outage, high latency, partial failure, and sudden traffic spikes—each with directly executable diagnostic and mitigation commands.

# Example: quickly roll back to the last stable version
kubectl rollout undo deployment/payment-service -n payments
kubectl rollout status deployment/payment-service -n payments

3. Database Incident Handling Runbook

For common database issues such as connection pool exhaustion, replication lag, and insufficient disk space, provide ad-hoc SQL queries and diagnostic commands. This helps on-call engineers quickly identify slow queries, terminate blocking connections, evaluate replication lag, and decide whether to perform a failover.

4. Communication Template Library

Include three standardized communication template sets: initial incident notification (internal Slack), status updates, and resolution notifications. Each template set contains required fields (e.g., impact scope, current actions, estimated time to resolution) to ensure critical information is clearly communicated even under pressure.

5. Escalation Matrix Design

Provide a condition-driven escalation decision table that automatically matches escalation targets (engineering manager, security team, legal department, etc.) and contact methods based on factors such as failure duration, impact scope, and whether data security is involved.

Common Questions

What’s the difference between an Incident runbook and a regular operations document?

An Incident runbook is designed for high-pressure environments. It emphasizes executable commands and step-by-step decision-making rather than conceptual explanations. A good runbook should allow an engineer who gets called at 3:00 AM to copy-paste and run commands directly, without needing to consult other documents. It includes clear entry conditions (when to use), verification steps (confirming whether it worked), and rollback procedures (if things get worse).

How is a SEV1 incident determined?

A SEV1 incident typically meets any of the following conditions: core business is completely unavailable (e.g., payment functionality fails 100%), risk of data loss or leakage, impact to a large number of paying customers, or direct financial loss. After determination, a response team must be formed within 15 minutes, and status must be synchronized to management every 15 minutes. If you’re unsure of the severity, err on the side of caution—handle it as SEV1 first, then downgrade based on investigation results.

How do I start writing my first runbook?

Start with the most critical services and prioritize issues that have already occurred. Use the “service outage runbook template” provided by this skill as a starting point, filling in service-specific details (e.g., Kubernetes namespace, Pod labels, health check endpoints, dependency services). Then run a game day rehearsal: invite the team to simulate incident handling using the runbook, and record any steps that are unclear or missing information. After each real incident, update the runbook to keep it aligned with the production environment.

Who is this skill for?

SRE/Operations Engineers: build standardized incident handling workflows for the team

Platform Engineers: write customized playbooks for core services

On-call Engineers: as a quick reference guide for incident response

Engineering Managers: design team escalation matrices and communication standards

DevOps Engineers: integrate runbooks into automation tools (e.g., PagerDuty, Opsgenie)

What are the limitations of this skill?

This skill provides templates and frameworks, not fully customized solutions for specific systems. You need to replace placeholders in the templates (e.g., [Service Name], [Slack Channel]) with real information, and adjust commands based on your team’s toolchain (Kubernetes, AWS, specific databases). The templates assume you use a standard cloud-native tool stack. If your environment differs significantly, you’ll need to validate command applicability yourself.

incident-runbook-templates

Author

Category

Install