incident-runbook-templates

Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.

View Source
name:incident-runbook-templatesdescription:Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.

Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

Do not use this skill when

  • The task is unrelated to incident runbook templates

  • You need a different domain or tool outside this scope
  • Instructions

  • Clarify goals, constraints, and required inputs.

  • Apply relevant best practices and validate outcomes.

  • Provide actionable steps and verification.

  • If detailed examples are required, open resources/implementation-playbook.md.
  • Use this skill when

  • Creating incident response procedures

  • Building service-specific runbooks

  • Establishing escalation paths

  • Documenting recovery procedures

  • Responding to active incidents

  • Onboarding on-call engineers
  • Core Concepts

    1. Incident Severity Levels

    SeverityImpactResponse TimeExample
    SEV1Complete outage, data loss15 minProduction down
    SEV2Major degradation30 minCritical feature broken
    SEV3Minor impact2 hoursNon-critical bug
    SEV4Minimal impactNext business dayCosmetic issue

    2. Runbook Structure

    1. Overview & Impact
  • Detection & Alerts

  • Initial Triage

  • Mitigation Steps

  • Root Cause Investigation

  • Resolution Procedures

  • Verification & Rollback

  • Communication Templates

  • Escalation Matrix
  • Runbook Templates

    Template 1: Service Outage Runbook

    # [Service Name] Outage Runbook

    Overview


    Service: Payment Processing Service
    Owner: Platform Team
    Slack: #payments-incidents
    PagerDuty: payments-oncall

    Impact Assessment


  • [ ] Which customers are affected?

  • [ ] What percentage of traffic is impacted?

  • [ ] Are there financial implications?

  • [ ] What's the blast radius?
  • Detection


    Alerts


  • payment_error_rate > 5% (PagerDuty)

  • payment_latency_p99 > 2s (Slack)

  • payment_success_rate < 95% (PagerDuty)
  • Dashboards


  • Payment Service Dashboard

  • Error Tracking

  • Dependency Status
  • Initial Triage (First 5 Minutes)

    1. Assess Scope

    bash

    Check service health


    kubectl get pods -n payments -l app=payment-service

    Check recent deployments


    kubectl rollout history deployment/payment-service -n payments

    Check error rates


    curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
    ### 2. Quick Health Checks
  • [ ] Can you reach the service? curl -I https://api.company.com/payments/health

  • [ ] Database connectivity? Check connection pool metrics

  • [ ] External dependencies? Check Stripe, bank API status

  • [ ] Recent changes? Check deploy history
  • 3. Initial Classification


    <div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Symptom</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Likely Cause</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Go To Section</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">All requests failing</td><td class="px-4 py-2 text-sm text-foreground">Service down</td><td class="px-4 py-2 text-sm text-foreground">Section 4.1</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">High latency</td><td class="px-4 py-2 text-sm text-foreground">Database/dependency</td><td class="px-4 py-2 text-sm text-foreground">Section 4.2</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Partial failures</td><td class="px-4 py-2 text-sm text-foreground">Code bug</td><td class="px-4 py-2 text-sm text-foreground">Section 4.3</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Spike in errors</td><td class="px-4 py-2 text-sm text-foreground">Traffic surge</td><td class="px-4 py-2 text-sm text-foreground">Section 4.4</td></tr></tbody></table></div>

    Mitigation Procedures

    4.1 Service Completely Down

    bash

    Step 1: Check pod status


    kubectl get pods -n payments

    Step 2: If pods are crash-looping, check logs


    kubectl logs -n payments -l app=payment-service --tail=100

    Step 3: Check recent deployments


    kubectl rollout history deployment/payment-service -n payments

    Step 4: ROLLBACK if recent deploy is suspect


    kubectl rollout undo deployment/payment-service -n payments

    Step 5: Scale up if resource constrained


    kubectl scale deployment/payment-service -n payments --replicas=10

    Step 6: Verify recovery


    kubectl rollout status deployment/payment-service -n payments
    ### 4.2 High Latency
    bash

    Step 1: Check database connections


    kubectl exec -n payments deploy/payment-service -- \
    curl localhost:8080/metrics | grep db_pool

    Step 2: Check slow queries (if DB issue)


    psql -h $DB_HOST -U $DB_USER -c "
    SELECT pid, now() - query_start AS duration, query
    FROM pg_stat_activity
    WHERE state = 'active' AND duration > interval '5 seconds'
    ORDER BY duration DESC;"

    Step 3: Kill long-running queries if needed


    psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

    Step 4: Check external dependency latency


    curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

    Step 5: Enable circuit breaker if dependency is slow


    kubectl set env deployment/payment-service \
    STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
    ### 4.3 Partial Failures (Specific Errors)
    bash

    Step 1: Identify error pattern


    kubectl logs -n payments -l app=payment-service --tail=500\

    Step 2: Check error tracking


    Go to Sentry: https://sentry.io/payments

    Step 3: If specific endpoint, enable feature flag to disable


    curl -X POST https://api.company.com/internal/feature-flags \
    -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

    Step 4: If data issue, check recent data changes


    psql -h $DB_HOST -c "
    SELECT FROM audit_log
    WHERE table_name = 'payment_methods'
    AND created_at > now() - interval '1 hour';"
    ### 4.4 Traffic Surge
    bash

    Step 1: Check current request rate


    kubectl top pods -n payments

    Step 2: Scale horizontally


    kubectl scale deployment/payment-service -n payments --replicas=20

    Step 3: Enable rate limiting


    kubectl set env deployment/payment-service \
    RATE_LIMIT_ENABLED=true \
    RATE_LIMIT_RPS=1000 -n payments

    Step 4: If attack, block suspicious IPs


    kubectl apply -f - <apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
    name: block-suspicious
    namespace: payments
    spec:
    podSelector:
    matchLabels:
    app: payment-service
    ingress:
    - from:
    - ipBlock:
    cidr: 0.0.0.0/0
    except:
    - 192.168.1.0/24 # Suspicious range
    EOF
    ## Verification Steps
    bash

    Verify service is healthy


    curl -s https://api.company.com/payments/health | jq

    Verify error rate is back to normal


    curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

    Verify latency is acceptable


    curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

    Smoke test critical flows


    ./scripts/smoke-test-payments.sh
    ## Rollback Procedures
    bash

    Rollback Kubernetes deployment


    kubectl rollout undo deployment/payment-service -n payments

    Rollback database migration (if applicable)


    ./scripts/db-rollback.sh $MIGRATION_VERSION

    Rollback feature flag


    curl -X POST https://api.company.com/internal/feature-flags \
    -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
    ## Escalation Matrix

    <div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Condition</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Escalate To</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Contact</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">&gt; 15 min unresolved SEV1</td><td class="px-4 py-2 text-sm text-foreground">Engineering Manager</td><td class="px-4 py-2 text-sm text-foreground">@manager (Slack)</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Data breach suspected</td><td class="px-4 py-2 text-sm text-foreground">Security Team</td><td class="px-4 py-2 text-sm text-foreground">#security-incidents</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Financial impact &gt; $10k</td><td class="px-4 py-2 text-sm text-foreground">Finance + Legal</td><td class="px-4 py-2 text-sm text-foreground">@finance-oncall</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Customer communication needed</td><td class="px-4 py-2 text-sm text-foreground">Support Lead</td><td class="px-4 py-2 text-sm text-foreground">@support-lead</td></tr></tbody></table></div>

    Communication Templates

    Initial Notification (Internal)


    🚨 INCIDENT: Payment Service Degradation

    Severity: SEV2
    Status: Investigating
    Impact: ~20% of payment requests failing
    Start Time: [TIME]
    Incident Commander: [NAME]

    Current Actions:

  • Investigating root cause

  • Scaling up service

  • Monitoring dashboards
  • Updates in #payments-incidents

    ### Status Update

    📊 UPDATE: Payment Service Incident

    Status: Mitigating
    Impact: Reduced to ~5% failure rate
    Duration: 25 minutes

    Actions Taken:

  • Rolled back deployment v2.3.4 → v2.3.3

  • Scaled service from 5 → 10 replicas
  • Next Steps:

  • Continuing to monitor

  • Root cause analysis in progress
  • ETA to Resolution: ~15 minutes

    ### Resolution Notification

    ✅ RESOLVED: Payment Service Incident

    Duration: 45 minutes
    Impact: ~5,000 affected transactions
    Root Cause: Memory leak in v2.3.4

    Resolution:

  • Rolled back to v2.3.3

  • Transactions auto-retried successfully
  • Follow-up:

  • Postmortem scheduled for [DATE]

  • Bug fix in progress

  • Template 2: Database Incident Runbook

    # Database Incident Runbook

    Quick Reference


    <div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Issue</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Command</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">Check connections</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT count(
    ) FROM pg_stat_activity;</code></td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Kill query</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT pg_terminate_backend(pid);</code></td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Check replication lag</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));</code></td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Check locks</td><td class="px-4 py-2 text-sm text-foreground"><code class="bg-muted px-1 py-0.5 rounded text-xs break-words">SELECT FROM pg_locks WHERE NOT granted;</code></td></tr></tbody></table></div>

    Connection Pool Exhaustion

    sql
    -- Check current connections
    SELECT datname, usename, state, count(
    )
    FROM pg_stat_activity
    GROUP BY datname, usename, state
    ORDER BY count(*) DESC;

    -- Identify long-running connections
    SELECT pid, usename, datname, state, query_start, query
    FROM pg_stat_activity
    WHERE state != 'idle'
    ORDER BY query_start;

    -- Terminate idle connections
    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state = 'idle'
    AND query_start < now() - interval '10 minutes';

    ## Replication Lag
    sql
    -- Check lag on replica
    SELECT
    CASE
    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
    END AS lag_seconds;

    -- If lag > 60s, consider:
    -- 1. Check network between primary/replica
    -- 2. Check replica disk I/O
    -- 3. Consider failover if unrecoverable

    ## Disk Space Critical
    bash

    Check disk usage


    df -h /var/lib/postgresql/data

    Find large tables


    psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
    FROM pg_catalog.pg_statio_user_tables
    ORDER BY pg_total_relation_size(relid) DESC
    LIMIT 10;"

    VACUUM to reclaim space


    psql -c "VACUUM FULL large_table;"

    If emergency, delete old data or expand disk


    Best Practices

    Do's


  • Keep runbooks updated - Review after every incident

  • Test runbooks regularly - Game days, chaos engineering

  • Include rollback steps - Always have an escape hatch

  • Document assumptions - What must be true for steps to work

  • Link to dashboards - Quick access during stress
  • Don'ts


  • Don't assume knowledge - Write for 3 AM brain

  • Don't skip verification - Confirm each step worked

  • Don't forget communication - Keep stakeholders informed

  • Don't work alone - Escalate early

  • Don't skip postmortems - Learn from every incident
  • Resources

  • Google SRE Book - Incident Management

  • PagerDuty Incident Response

  • Atlassian Incident Management