postmortem-writing - Agent Skills

Postmortem Writing

Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.

Do not use this skill when

The task is unrelated to postmortem writing

You need a different domain or tool outside this scope

Instructions

Clarify goals, constraints, and required inputs.

Apply relevant best practices and validate outcomes.

Provide actionable steps and verification.

If detailed examples are required, open resources/implementation-playbook.md.

Use this skill when

Conducting post-incident reviews

Writing postmortem documents

Facilitating blameless postmortem meetings

Identifying root causes and contributing factors

Creating actionable follow-up items

Building organizational learning culture

Core Concepts

1. Blameless Culture

Blame-Focused	Blameless
"Who caused this?"	"What conditions allowed this?"
"Someone made a mistake"	"The system allowed this mistake"
Punish individuals	Improve systems
Hide information	Share learnings
Fear of speaking up	Psychological safety

2. Postmortem Triggers

SEV1 or SEV2 incidents

Customer-facing outages > 15 minutes

Data loss or security incidents

Near-misses that could have been severe

Novel failure modes

Incidents requiring unusual intervention

Quick Start

Postmortem Timeline

Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidents

Templates

Template 1: Standard Postmortem

# Postmortem: [Incident Title]
Date: 2024-01-15
Authors: @alice, @bob
Status: Draft | In Review | Final
Incident Severity: SEV2
Incident Duration: 47 minutes
Executive Summary
On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
Impact:
12,000 customers unable to complete purchases

Estimated revenue loss: $45,000

847 support tickets created

No data loss or security implications
Timeline (All times UTC)
<div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Time</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Event</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">14:23</td><td class="px-4 py-2 text-sm text-foreground">Deployment v2.3.4 completed to production</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">14:31</td><td class="px-4 py-2 text-sm text-foreground">First alert: <code class="bg-muted px-1 py-0.5 rounded text-xs break-words">payment_error_rate &gt; 5%</code></td></tr><tr><td class="px-4 py-2 text-sm text-foreground">14:33</td><td class="px-4 py-2 text-sm text-foreground">On-call engineer @alice acknowledges alert</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">14:35</td><td class="px-4 py-2 text-sm text-foreground">Initial investigation begins, error rate at 23%</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">14:41</td><td class="px-4 py-2 text-sm text-foreground">Incident declared SEV2, @bob joins</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">14:45</td><td class="px-4 py-2 text-sm text-foreground">Database connection exhaustion identified</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">14:52</td><td class="px-4 py-2 text-sm text-foreground">Decision to rollback deployment</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">14:58</td><td class="px-4 py-2 text-sm text-foreground">Rollback to v2.3.3 initiated</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">15:10</td><td class="px-4 py-2 text-sm text-foreground">Rollback complete, error rate dropping</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">15:18</td><td class="px-4 py-2 text-sm text-foreground">Service fully recovered, incident resolved</td></tr></tbody></table></div>
Root Cause Analysis
What Happened
The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
Why It Happened
Proximate Cause: Code change in PaymentRepository.java replaced pooled DataSource with direct DriverManager.getConnection() calls.
Contributing Factors:

   - Code review did not catch the connection handling change
   - No integration tests specifically for connection pool behavior
   - Staging environment has lower traffic, masking the issue
   - Database connection metrics alert threshold was too high (90%)
5 Whys Analysis:

   - Why did the service fail? → Database connections exhausted
   - Why were connections exhausted? → Each request opened new connection
   - Why did each request open new connection? → Code bypassed connection pool
   - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
   - Why was developer unfamiliar? → No documentation on connection management patterns
System Diagram

[Client] → [Load Balancer] → [Payment Service] → [Database]
↓
Connection Pool (broken)
↓
Direct connections (cause)

## Detection
What Worked

Error rate alert fired within 8 minutes of deployment

Grafana dashboard clearly showed connection spike

On-call response was swift (2 minute acknowledgment)
What Didn't Work

Database connection metric alert threshold too high

No deployment-correlated alerting

Canary deployment would have caught this earlier
Detection Gap

The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
Response
What Worked

On-call engineer quickly identified database as the issue

Rollback decision was made decisively

Clear communication in incident channel
What Could Be Improved

Took 10 minutes to correlate issue with recent deployment

Had to manually check deployment history

Rollback took 12 minutes (could be faster)
Impact
Customer Impact

12,000 unique customers affected

Average impact duration: 35 minutes

847 support tickets (23% of affected users)

Customer satisfaction score dropped 12 points
Business Impact

Estimated revenue loss: $45,000

Support cost: ~$2,500 (agent time)

Engineering time: ~8 person-hours
Technical Impact

Database primary experienced elevated load

Some replica lag during incident

No permanent damage to systems
Lessons Learned
What Went Well

Alerting detected the issue before customer reports

Team collaborated effectively under pressure

Rollback procedure worked smoothly

Communication was clear and timely
What Went Wrong

Code review missed critical change

Test coverage gap for connection pooling

Staging environment doesn't reflect production traffic

Alert thresholds were not tuned properly
Where We Got Lucky

Incident occurred during business hours with full team available

Database handled the load without failing completely

No other incidents occurred simultaneously
Action Items
<div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Priority</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Action</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Owner</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Due Date</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Ticket</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">P0</td><td class="px-4 py-2 text-sm text-foreground">Add integration test for connection pool behavior</td><td class="px-4 py-2 text-sm text-foreground">@alice</td><td class="px-4 py-2 text-sm text-foreground">2024-01-22</td><td class="px-4 py-2 text-sm text-foreground">ENG-1234</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">P0</td><td class="px-4 py-2 text-sm text-foreground">Lower database connection alert threshold to 70%</td><td class="px-4 py-2 text-sm text-foreground">@bob</td><td class="px-4 py-2 text-sm text-foreground">2024-01-17</td><td class="px-4 py-2 text-sm text-foreground">OPS-567</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">P1</td><td class="px-4 py-2 text-sm text-foreground">Document connection management patterns</td><td class="px-4 py-2 text-sm text-foreground">@alice</td><td class="px-4 py-2 text-sm text-foreground">2024-01-29</td><td class="px-4 py-2 text-sm text-foreground">DOC-89</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">P1</td><td class="px-4 py-2 text-sm text-foreground">Implement deployment-correlated alerting</td><td class="px-4 py-2 text-sm text-foreground">@bob</td><td class="px-4 py-2 text-sm text-foreground">2024-02-05</td><td class="px-4 py-2 text-sm text-foreground">OPS-568</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">P2</td><td class="px-4 py-2 text-sm text-foreground">Evaluate canary deployment strategy</td><td class="px-4 py-2 text-sm text-foreground">@charlie</td><td class="px-4 py-2 text-sm text-foreground">2024-02-15</td><td class="px-4 py-2 text-sm text-foreground">ENG-1235</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">P2</td><td class="px-4 py-2 text-sm text-foreground">Load test staging with production-like traffic</td><td class="px-4 py-2 text-sm text-foreground">@dave</td><td class="px-4 py-2 text-sm text-foreground">2024-02-28</td><td class="px-4 py-2 text-sm text-foreground">QA-123</td></tr></tbody></table></div>
Appendix
Supporting Data
Error Rate Graph

[Link to Grafana dashboard snapshot]
Database Connection Graph

[Link to metrics]
Related Incidents

2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)
References

Connection Pool Best Practices

Deployment Runbook

Template 2: 5 Whys Analysis

# 5 Whys Analysis: [Incident]
Problem Statement

Payment service experienced 47-minute outage due to database connection exhaustion.
Analysis
Why #1: Why did the service fail?

Answer: Database connections were exhausted, causing all new requests to fail.
Evidence: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
Why #2: Why were database connections exhausted?

Answer: Each incoming request opened a new database connection instead of using the connection pool.
Evidence: Code diff shows direct DriverManager.getConnection() instead of pooled DataSource.
Why #3: Why did the code bypass the connection pool?

Answer: A developer refactored the repository class and inadvertently changed the connection acquisition method.
Evidence: PR #1234 shows the change, made while fixing a different bug.
Why #4: Why wasn't this caught in code review?

Answer: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
Evidence: Review comments only discuss business logic.
Why #5: Why isn't there a safety net for this type of change?

Answer: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
Evidence: Test suite has no tests for connection handling; wiki has no article on database connections.
Root Causes Identified
Primary: Missing automated tests for infrastructure behavior

Secondary: Insufficient documentation of architectural patterns

Tertiary: Code review checklist doesn't include infrastructure considerations
Systemic Improvements
<div class="overflow-x-auto my-6"><table class="min-w-full divide-y divide-border border border-border"><thead><tr><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Root Cause</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Improvement</th><th class="px-4 py-2 text-left text-sm font-semibold text-foreground bg-muted/50">Type</th></tr></thead><tbody class="divide-y divide-border"><tr><td class="px-4 py-2 text-sm text-foreground">Missing tests</td><td class="px-4 py-2 text-sm text-foreground">Add infrastructure behavior tests</td><td class="px-4 py-2 text-sm text-foreground">Prevention</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Missing docs</td><td class="px-4 py-2 text-sm text-foreground">Document connection patterns</td><td class="px-4 py-2 text-sm text-foreground">Prevention</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">Review gaps</td><td class="px-4 py-2 text-sm text-foreground">Update review checklist</td><td class="px-4 py-2 text-sm text-foreground">Detection</td></tr><tr><td class="px-4 py-2 text-sm text-foreground">No canary</td><td class="px-4 py-2 text-sm text-foreground">Implement canary deployments</td><td class="px-4 py-2 text-sm text-foreground">Mitigation</td></tr></tbody></table></div>

Template 3: Quick Postmortem (Minor Incidents)

# Quick Postmortem: [Brief Title]
Date: 2024-01-15 | Duration: 12 min | Severity: SEV3
What Happened

API latency spiked to 5s due to cache miss storm after cache flush.
Timeline

10:00 - Cache flush initiated for config update

10:02 - Latency alerts fire

10:05 - Identified as cache miss storm

10:08 - Enabled cache warming

10:12 - Latency normalized
Root Cause

Full cache flush for minor config update caused thundering herd.
Fix

Immediate: Enabled cache warming

Long-term: Implement partial cache invalidation (ENG-999)
Lessons

Don't full-flush cache in production; use targeted invalidation.

Facilitation Guide

Running a Postmortem Meeting

## Meeting Structure (60 minutes)
1. Opening (5 min)

Remind everyone of blameless culture

"We're here to learn, not to blame"

Review meeting norms
2. Timeline Review (15 min)

Walk through events chronologically

Ask clarifying questions

Identify gaps in timeline
3. Analysis Discussion (20 min)

What failed?

Why did it fail?

What conditions allowed this?

What would have prevented it?
4. Action Items (15 min)

Brainstorm improvements

Prioritize by impact and effort

Assign owners and due dates
5. Closing (5 min)

Summarize key learnings

Confirm action item owners

Schedule follow-up if needed
Facilitation Tips

Keep discussion on track

Redirect blame to systems

Encourage quiet participants

Document dissenting views

Time-box tangents

Anti-Patterns to Avoid

Anti-Pattern	Problem	Better Approach
Blame game	Shuts down learning	Focus on systems
Shallow analysis	Doesn't prevent recurrence	Ask "why" 5 times
No action items	Waste of time	Always have concrete next steps
Unrealistic actions	Never completed	Scope to achievable tasks
No follow-up	Actions forgotten	Track in ticketing system

Best Practices

Do's

Start immediately - Memory fades fast

Be specific - Exact times, exact errors

Include graphs - Visual evidence

Assign owners - No orphan action items

Share widely - Organizational learning

Don'ts

Don't name and shame - Ever

Don't skip small incidents - They reveal patterns

Don't make it a blame doc - That kills learning

Don't create busywork - Actions should be meaningful

Don't skip follow-up - Verify actions completed

Resources

Google SRE - Postmortem Culture

Etsy's Blameless Postmortems

PagerDuty Postmortem Guide