on-call-handoff-patterns
掌握值班交接的关键:实现上下文传递、升级流程与文档记录。适用于交接值班职责、撰写轮班总结或优化值班流程。
On-Call Handoff Patterns
Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.
Do not use this skill when
Instructions
resources/implementation-playbook.md.Use this skill when
Core Concepts
1. Handoff Components
| Component | Purpose |
|---|---|
| Active Incidents | What's currently broken |
| Ongoing Investigations | Issues being debugged |
| Recent Changes | Deployments, configs |
| Known Issues | Workarounds in place |
| Upcoming Events | Maintenance, releases |
2. Handoff Timing
Recommended: 30 min overlap between shiftsOutgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming
Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup
Templates
Template 1: Shift Handoff Document
# On-Call Handoff: Platform TeamOutgoing: @alice (2024-01-15 to 2024-01-22)
Incoming: @bob (2024-01-22 to 2024-01-29)
Handoff Time: 2024-01-22 09:00 UTC
🔴 Active Incidents
None currently active
No active incidents at handoff time.
🟡 Ongoing Investigations
1. Intermittent API Timeouts (ENG-1234)
Status: Investigating
Started: 2024-01-20
Impact: ~0.1% of requests timing outContext:
Timeouts correlate with database backup window (02:00-03:00 UTC)
Suspect backup process causing lock contention
Added extra logging in PR #567 (deployed 01/21) Next Steps:
[ ] Review new logs after tonight's backup
[ ] Consider moving backup window if confirmed Resources:
Dashboard: API Latency
Thread: #platform-eng (01/20, 14:32)
2. Memory Growth in Auth Service (ENG-1235)
Status: Monitoring
Started: 2024-01-18
Impact: None yet (proactive)Context:
Memory usage growing ~5% per day
No memory leak found in profiling
Suspect connection pool not releasing properly Next Steps:
[ ] Review heap dump from 01/21
[ ] Consider restart if usage > 80% Resources:
Dashboard: Auth Service Memory
Analysis doc: Memory Investigation
🟢 Resolved This Shift
Payment Service Outage (2024-01-19)
Duration: 23 minutes
Root Cause: Database connection exhaustion
Resolution: Rolled back v2.3.4, increased pool size
Postmortem: POSTMORTEM-89
Follow-up tickets: ENG-1230, ENG-1231
📋 Recent Changes
Deployments
<div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Service</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Version</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Time</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Notes</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">api-gateway</td><td class="px-4 py-2 text-sm text-foreground">v3.2.1</td><td class="px-4 py-2 text-sm text-foreground">01/21 14:00</td><td class="px-4 py-2 text-sm text-foreground">Bug fix for header parsing</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">user-service</td><td class="px-4 py-2 text-sm text-foreground">v2.8.0</td><td class="px-4 py-2 text-sm text-foreground">01/20 10:00</td><td class="px-4 py-2 text-sm text-foreground">New profile features</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">auth-service</td><td class="px-4 py-2 text-sm text-foreground">v4.1.2</td><td class="px-4 py-2 text-sm text-foreground">01/19 16:00</td><td class="px-4 py-2 text-sm text-foreground">Security patch</td></tr></tbody></table></div>Configuration Changes
01/21: Increased API rate limit from 1000 to 1500 RPS
01/20: Updated database connection pool max from 50 to 75 Infrastructure
01/20: Added 2 nodes to Kubernetes cluster
01/19: Upgraded Redis from 6.2 to 7.0
⚠️ Known Issues & Workarounds
1. Slow Dashboard Loading
Issue: Grafana dashboards slow on Monday mornings
Workaround: Wait 5 min after 08:00 UTC for cache warm-up
Ticket: OPS-456 (P3)2. Flaky Integration Test
Issue: test_payment_flow fails intermittently in CI
Workaround: Re-run failed job (usually passes on retry)
Ticket: ENG-1200 (P2)
📅 Upcoming Events
<div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Date</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Event</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Impact</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Contact</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">01/23 02:00</td><td class="px-4 py-2 text-sm text-foreground">Database maintenance</td><td class="px-4 py-2 text-sm text-foreground">5 min read-only</td><td class="px-4 py-2 text-sm text-foreground">@dba-team</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">01/24 14:00</td><td class="px-4 py-2 text-sm text-foreground">Major release v5.0</td><td class="px-4 py-2 text-sm text-foreground">Monitor closely</td><td class="px-4 py-2 text-sm text-foreground">@release-team</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">01/25</td><td class="px-4 py-2 text-sm text-foreground">Marketing campaign</td><td class="px-4 py-2 text-sm text-foreground">2x traffic expected</td><td class="px-4 py-2 text-sm text-foreground">@platform</td></tr></tbody></table></div>
📞 Escalation Reminders
<div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Issue Type</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">First Escalation</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Second Escalation</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">Payment issues</td><td class="px-4 py-2 text-sm text-foreground">@payments-oncall</td><td class="px-4 py-2 text-sm text-foreground">@payments-manager</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Auth issues</td><td class="px-4 py-2 text-sm text-foreground">@auth-oncall</td><td class="px-4 py-2 text-sm text-foreground">@security-team</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Database issues</td><td class="px-4 py-2 text-sm text-foreground">@dba-team</td><td class="px-4 py-2 text-sm text-foreground">@infra-manager</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Unknown/severe</td><td class="px-4 py-2 text-sm text-foreground">@engineering-manager</td><td class="px-4 py-2 text-sm text-foreground">@vp-engineering</td></tr></tbody></table></div>
🔧 Quick Reference
Common Commands
bashCheck service health
kubectl get pods -A | grep -v Running
Recent deployments
kubectl get events --sort-by='.lastTimestamp' | tail -20
Database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"
Clear cache (emergency only)
redis-cli FLUSHDB
### Important Links
Runbooks
Service Catalog
Incident Slack
PagerDuty
Handoff Checklist
Outgoing Engineer
[x] Document active incidents
[x] Document ongoing investigations
[x] List recent changes
[x] Note known issues
[x] Add upcoming events
[x] Sync with incoming engineer Incoming Engineer
[ ] Read this document
[ ] Join sync call
[ ] Verify PagerDuty is routing to you
[ ] Verify Slack notifications working
[ ] Check VPN/access working
[ ] Review critical dashboards Template 2: Quick Handoff (Async)
# Quick Handoff: @alice → @bobTL;DR
No active incidents
1 investigation ongoing (API timeouts, see ENG-1234)
Major release tomorrow (01/24) - be ready for issues Watch List
API latency around 02:00-03:00 UTC (backup window)
Auth service memory (restart if > 80%) Recent
Deployed api-gateway v3.2.1 yesterday (stable)
Increased rate limits to 1500 RPS Coming Up
01/23 02:00 - DB maintenance (5 min read-only)
01/24 14:00 - v5.0 release Questions?
I'll be available on Slack until 17:00 today.Template 3: Incident Handoff (Mid-Incident)
# INCIDENT HANDOFF: Payment Service DegradationIncident Start: 2024-01-22 08:15 UTC
Current Status: Mitigating
Severity: SEV2
Current State
Error rate: 15% (down from 40%)
Mitigation in progress: scaling up pods
ETA to resolution: ~30 min What We Know
Root cause: Memory pressure on payment-service pods
Triggered by: Unusual traffic spike (3x normal)
Contributing: Inefficient query in checkout flow What We've Done
Scaled payment-service from 5 → 15 pods
Enabled rate limiting on checkout endpoint
Disabled non-critical features What Needs to Happen
Monitor error rate - should reach <1% in ~15 min
If not improving, escalate to @payments-manager
Once stable, begin root cause investigation Key People
Incident Commander: @alice (handing off)
Comms Lead: @charlie
Technical Lead: @bob (incoming) Communication
Status page: Updated at 08:45
Customer support: Notified
Exec team: Aware Resources
Incident channel: #inc-20240122-payment
Dashboard: Payment Service
Runbook: Payment Degradation
Incoming on-call (@bob) - Please confirm you have:
[ ] Joined #inc-20240122-payment
[ ] Access to dashboards
[ ] Understand current state
[ ] Know escalation path Handoff Sync Meeting
Agenda (15 minutes)
## Handoff Sync: @alice → @bobActive Issues (5 min)
- Walk through any ongoing incidents
- Discuss investigation status
- Transfer context and theoriesRecent Changes (3 min)
- Deployments to watch
- Config changes
- Known regressionsUpcoming Events (3 min)
- Maintenance windows
- Expected traffic changes
- Releases plannedQuestions (4 min)
- Clarify anything unclear
- Confirm access and alerting
- Exchange contact infoOn-Call Best Practices
Before Your Shift
## Pre-Shift ChecklistAccess Verification
[ ] VPN working
[ ] kubectl access to all clusters
[ ] Database read access
[ ] Log aggregator access (Splunk/Datadog)
[ ] PagerDuty app installed and logged in Alerting Setup
[ ] PagerDuty schedule shows you as primary
[ ] Phone notifications enabled
[ ] Slack notifications for incident channels
[ ] Test alert received and acknowledged Knowledge Refresh
[ ] Review recent incidents (past 2 weeks)
[ ] Check service changelog
[ ] Skim critical runbooks
[ ] Know escalation contacts Environment Ready
[ ] Laptop charged and accessible
[ ] Phone charged
[ ] Quiet space available for calls
[ ] Secondary contact identified (if traveling) During Your Shift
## Daily On-Call RoutineMorning (start of day)
[ ] Check overnight alerts
[ ] Review dashboards for anomalies
[ ] Check for any P0/P1 tickets created
[ ] Skim incident channels for context Throughout Day
[ ] Respond to alerts within SLA
[ ] Document investigation progress
[ ] Update team on significant issues
[ ] Triage incoming pages End of Day
[ ] Hand off any active issues
[ ] Update investigation docs
[ ] Note anything for next shift After Your Shift
## Post-Shift Checklist[ ] Complete handoff document
[ ] Sync with incoming on-call
[ ] Verify PagerDuty routing changed
[ ] Close/update investigation tickets
[ ] File postmortems for any incidents
[ ] Take time off if shift was stressful Escalation Guidelines
When to Escalate
## Escalation TriggersImmediate Escalation
SEV1 incident declared
Data breach suspected
Unable to diagnose within 30 min
Customer or legal escalation received Consider Escalation
Issue spans multiple teams
Requires expertise you don't have
Business impact exceeds threshold
You're uncertain about next steps How to Escalate
Page the appropriate escalation path
Provide brief context in Slack
Stay engaged until escalation acknowledges
Hand off cleanly, don't just disappear