incident-runbook-templates
创建结构化的事件响应手册,包含分步操作流程、升级路径和恢复措施。适用于构建手册、应对事件或建立事件响应程序时使用。
Incident Runbook Templates
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
Do not use this skill when
Instructions
resources/implementation-playbook.md.Use this skill when
Core Concepts
1. Incident Severity Levels
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage, data loss | 15 min | Production down |
| SEV2 | Major degradation | 30 min | Critical feature broken |
| SEV3 | Minor impact | 2 hours | Non-critical bug |
| SEV4 | Minimal impact | Next business day | Cosmetic issue |
2. Runbook Structure
1. Overview & Impact
Detection & Alerts
Initial Triage
Mitigation Steps
Root Cause Investigation
Resolution Procedures
Verification & Rollback
Communication Templates
Escalation Matrix Runbook Templates
Template 1: Service Outage Runbook
# [Service Name] Outage RunbookOverview
Service: Payment Processing Service
Owner: Platform Team
Slack: #payments-incidents
PagerDuty: payments-oncallImpact Assessment
[ ] Which customers are affected?
[ ] What percentage of traffic is impacted?
[ ] Are there financial implications?
[ ] What's the blast radius? Detection
Alerts
payment_error_rate > 5% (PagerDuty)
payment_latency_p99 > 2s (Slack)
payment_success_rate < 95% (PagerDuty)Dashboards
Payment Service Dashboard
Error Tracking
Dependency Status Initial Triage (First 5 Minutes)
1. Assess Scope
bashCheck service health
kubectl get pods -n payments -l app=payment-service
Check recent deployments
kubectl rollout history deployment/payment-service -n payments
Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
### 2. Quick Health Checks
[ ] Can you reach the service? curl -I https://api.company.com/payments/health
[ ] Database connectivity? Check connection pool metrics
[ ] External dependencies? Check Stripe, bank API status
[ ] Recent changes? Check deploy history 3. Initial Classification
<div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Symptom</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Likely Cause</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Go To Section</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">All requests failing</td><td class="px-4 py-2 text-sm text-foreground">Service down</td><td class="px-4 py-2 text-sm text-foreground">Section 4.1</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">High latency</td><td class="px-4 py-2 text-sm text-foreground">Database/dependency</td><td class="px-4 py-2 text-sm text-foreground">Section 4.2</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Partial failures</td><td class="px-4 py-2 text-sm text-foreground">Code bug</td><td class="px-4 py-2 text-sm text-foreground">Section 4.3</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Spike in errors</td><td class="px-4 py-2 text-sm text-foreground">Traffic surge</td><td class="px-4 py-2 text-sm text-foreground">Section 4.4</td></tr></tbody></table></div>Mitigation Procedures
4.1 Service Completely Down
bashStep 1: Check pod status
kubectl get pods -n payments
Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100
Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments
Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments
Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10
Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments
### 4.2 High LatencybashStep 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
curl localhost:8080/metrics | grep db_pool
Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"
Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
### 4.3 Partial Failures (Specific Errors)bashStep 1: Identify error pattern
| kubectl logs -n payments -l app=payment-service --tail=500 | \ |
|---|
Step 2: Check error tracking
Go to Sentry: https://sentry.io/payments
Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
SELECT FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"
### 4.4 Traffic SurgebashStep 1: Check current request rate
kubectl top pods -n payments
Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20
Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
RATE_LIMIT_ENABLED=true \
RATE_LIMIT_RPS=1000 -n payments
Step 4: If attack, block suspicious IPs
kubectl apply -f - <
kind: NetworkPolicy
metadata:
name: block-suspicious
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.1.0/24 # Suspicious range
EOF
## Verification StepsbashVerify service is healthy
curl -s https://api.company.com/payments/health | jq
Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
Smoke test critical flows
./scripts/smoke-test-payments.sh
## Rollback ProceduresbashRollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments
Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION
Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
## Escalation Matrix<div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Condition</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Escalate To</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Contact</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">> 15 min unresolved SEV1</td><td class="px-4 py-2 text-sm text-foreground">Engineering Manager</td><td class="px-4 py-2 text-sm text-foreground">@manager (Slack)</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Data breach suspected</td><td class="px-4 py-2 text-sm text-foreground">Security Team</td><td class="px-4 py-2 text-sm text-foreground">#security-incidents</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Financial impact > $10k</td><td class="px-4 py-2 text-sm text-foreground">Finance + Legal</td><td class="px-4 py-2 text-sm text-foreground">@finance-oncall</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Customer communication needed</td><td class="px-4 py-2 text-sm text-foreground">Support Lead</td><td class="px-4 py-2 text-sm text-foreground">@support-lead</td></tr></tbody></table></div>
Communication Templates
Initial Notification (Internal)
🚨 INCIDENT: Payment Service Degradation
Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]
Current Actions:
Updates in #payments-incidents
### Status Update📊 UPDATE: Payment Service Incident
Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes
Actions Taken:
Next Steps:
ETA to Resolution: ~15 minutes
### Resolution Notification✅ RESOLVED: Payment Service Incident
Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4
Resolution:
Follow-up:
Template 2: Database Incident Runbook
# Database Incident RunbookQuick Reference
<div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Issue</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Command</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">Check connections</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT count() FROM pg_stat_activity;</code></td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Kill query</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT pg_terminate_backend(pid);</code></td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Check replication lag</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));</code></td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Check locks</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT FROM pg_locks WHERE NOT granted;</code></td></tr></tbody></table></div>Connection Pool Exhaustionsql
-- Check current connections
SELECT datname, usename, state, count()
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;
-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
## Replication Lagsql-- Check lag on replica
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;
-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable
## Disk Space CriticalbashCheck disk usage
df -h /var/lib/postgresql/data
Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"
VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"
If emergency, delete old data or expand disk