incident-runbook-templates

Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

Do not use this skill when

The task is unrelated to incident runbook templates

You need a different domain or tool outside this scope

Instructions

Clarify goals, constraints, and required inputs.

Apply relevant best practices and validate outcomes.

Provide actionable steps and verification.

If detailed examples are required, open resources/implementation-playbook.md.

Use this skill when

Creating incident response procedures

Building service-specific runbooks

Establishing escalation paths

Documenting recovery procedures

Responding to active incidents

Onboarding on-call engineers

Core Concepts

1. Incident Severity Levels

Severity	Impact	Response Time	Example
SEV1	Complete outage, data loss	15 min	Production down
SEV2	Major degradation	30 min	Critical feature broken
SEV3	Minor impact	2 hours	Non-critical bug
SEV4	Minimal impact	Next business day	Cosmetic issue

2. Runbook Structure

1. Overview & Impact
Detection & Alerts

Initial Triage

Mitigation Steps

Root Cause Investigation

Resolution Procedures

Verification & Rollback

Communication Templates

Escalation Matrix

Runbook Templates

Template 1: Service Outage Runbook

# [Service Name] Outage Runbook
Overview

Service: Payment Processing Service
Owner: Platform Team
Slack: #payments-incidents
PagerDuty: payments-oncall
Impact Assessment

[ ] Which customers are affected?

[ ] What percentage of traffic is impacted?

[ ] Are there financial implications?

[ ] What's the blast radius?
Detection

Alerts

payment_error_rate > 5% (PagerDuty)

payment_latency_p99 > 2s (Slack)

payment_success_rate < 95% (PagerDuty)
Dashboards

Payment Service Dashboard

Error Tracking

Dependency Status
Initial Triage (First 5 Minutes)
1. Assess Scope

bash

Check service health

kubectl get pods -n payments -l app=payment-service

Check recent deployments

kubectl rollout history deployment/payment-service -n payments

Check error rates

curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"

### 2. Quick Health Checks
[ ] Can you reach the service? curl -I https://api.company.com/payments/health

[ ] Database connectivity? Check connection pool metrics

[ ] External dependencies? Check Stripe, bank API status

[ ] Recent changes? Check deploy history
3. Initial Classification

<div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Symptom</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Likely Cause</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Go To Section</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">All requests failing</td><td class="px-4 py-2 text-sm text-foreground">Service down</td><td class="px-4 py-2 text-sm text-foreground">Section 4.1</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">High latency</td><td class="px-4 py-2 text-sm text-foreground">Database/dependency</td><td class="px-4 py-2 text-sm text-foreground">Section 4.2</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Partial failures</td><td class="px-4 py-2 text-sm text-foreground">Code bug</td><td class="px-4 py-2 text-sm text-foreground">Section 4.3</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Spike in errors</td><td class="px-4 py-2 text-sm text-foreground">Traffic surge</td><td class="px-4 py-2 text-sm text-foreground">Section 4.4</td></tr></tbody></table></div>
Mitigation Procedures
4.1 Service Completely Down

bash

Step 1: Check pod status

kubectl get pods -n payments

Step 2: If pods are crash-looping, check logs

kubectl logs -n payments -l app=payment-service --tail=100

Step 3: Check recent deployments

kubectl rollout history deployment/payment-service -n payments

Step 4: ROLLBACK if recent deploy is suspect

kubectl rollout undo deployment/payment-service -n payments

Step 5: Scale up if resource constrained

kubectl scale deployment/payment-service -n payments --replicas=10

Step 6: Verify recovery

kubectl rollout status deployment/payment-service -n payments

### 4.2 High Latency

bash

Step 1: Check database connections

kubectl exec -n payments deploy/payment-service -- \
curl localhost:8080/metrics | grep db_pool

Step 2: Check slow queries (if DB issue)

psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"

Step 3: Kill long-running queries if needed

psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

Step 4: Check external dependency latency

curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

Step 5: Enable circuit breaker if dependency is slow

kubectl set env deployment/payment-service \
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

### 4.3 Partial Failures (Specific Errors)

bash

Step 1: Identify error pattern

kubectl logs -n payments -l app=payment-service --tail=500	\

Step 2: Check error tracking

Go to Sentry: https://sentry.io/payments

Step 3: If specific endpoint, enable feature flag to disable

curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

Step 4: If data issue, check recent data changes

psql -h $DB_HOST -c "
SELECT FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"
### 4.4 Traffic Surge
bash
Step 1: Check current request rate

kubectl top pods -n payments
Step 2: Scale horizontally

kubectl scale deployment/payment-service -n payments --replicas=20
Step 3: Enable rate limiting

kubectl set env deployment/payment-service \
RATE_LIMIT_ENABLED=true \
RATE_LIMIT_RPS=1000 -n payments
Step 4: If attack, block suspicious IPs

kubectl apply -f - <apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-suspicious
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.1.0/24 # Suspicious range
EOF
## Verification Steps
bash
Verify service is healthy

curl -s https://api.company.com/payments/health | jq
Verify error rate is back to normal

curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
Verify latency is acceptable

curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
Smoke test critical flows

./scripts/smoke-test-payments.sh
## Rollback Procedures
bash
Rollback Kubernetes deployment

kubectl rollout undo deployment/payment-service -n payments
Rollback database migration (if applicable)

./scripts/db-rollback.sh $MIGRATION_VERSION
Rollback feature flag

curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
## Escalation Matrix <div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Condition</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Escalate To</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Contact</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">> 15 min unresolved SEV1</td><td class="px-4 py-2 text-sm text-foreground">Engineering Manager</td><td class="px-4 py-2 text-sm text-foreground">@manager (Slack)</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Data breach suspected</td><td class="px-4 py-2 text-sm text-foreground">Security Team</td><td class="px-4 py-2 text-sm text-foreground">#security-incidents</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Financial impact > $10k</td><td class="px-4 py-2 text-sm text-foreground">Finance + Legal</td><td class="px-4 py-2 text-sm text-foreground">@finance-oncall</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Customer communication needed</td><td class="px-4 py-2 text-sm text-foreground">Support Lead</td><td class="px-4 py-2 text-sm text-foreground">@support-lead</td></tr></tbody></table></div> Communication Templates
Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation
Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]
Current Actions:
Investigating root cause

Scaling up service

Monitoring dashboards
Updates in #payments-incidents
### Status Update

📊 UPDATE: Payment Service Incident
Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes
Actions Taken:
Rolled back deployment v2.3.4 → v2.3.3

Scaled service from 5 → 10 replicas
Next Steps:
Continuing to monitor

Root cause analysis in progress
ETA to Resolution: ~15 minutes
### Resolution Notification

✅ RESOLVED: Payment Service Incident
Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4
Resolution:
Rolled back to v2.3.3

Transactions auto-retried successfully
Follow-up:
Postmortem scheduled for [DATE]

Bug fix in progress

Template 2: Database Incident Runbook

# Database Incident Runbook
Quick Reference

<div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Issue</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Command</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">Check connections</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT count() FROM pg_stat_activity;</code></td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Kill query</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT pg_terminate_backend(pid);</code></td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Check replication lag</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));</code></td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Check locks</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT  FROM pg_locks WHERE NOT granted;</code></td></tr></tbody></table></div>
Connection Pool Exhaustion

sql
-- Check current connections
SELECT datname, usename, state, count()
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';

## Replication Lag

sql
-- Check lag on replica
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;

-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable

## Disk Space Critical

bash

Check disk usage

df -h /var/lib/postgresql/data

Find large tables

psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"

VACUUM to reclaim space

psql -c "VACUUM FULL large_table;"

If emergency, delete old data or expand disk

Best Practices

Do's

Keep runbooks updated - Review after every incident

Test runbooks regularly - Game days, chaos engineering

Include rollback steps - Always have an escape hatch

Document assumptions - What must be true for steps to work

Link to dashboards - Quick access during stress

Don'ts

Don't assume knowledge - Write for 3 AM brain

Don't skip verification - Confirm each step worked

Don't forget communication - Keep stakeholders informed

Don't work alone - Escalate early

Don't skip postmortems - Learn from every incident

Resources

Google SRE Book - Incident Management

PagerDuty Incident Response

Atlassian Incident Management