incident-responder
Expert SRE incident responder specializing in rapid problem resolution, modern observability, and comprehensive incident management. Masters incident command, blameless post-mortems, error budget management, and system reliability patterns. Handles critical outages, communication strategies, and continuous improvement. Use IMMEDIATELY for production incidents or SRE practices.
Author
Category
Other ToolsInstall
Hot:11
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-incident-responder&locale=en&source=copy
Incident Responder - SRE Incident Response Expert
Skill Overview
Incident Responder is a professional SRE incident response assistant that helps you rapidly pinpoint problems during production incidents, coordinate response teams, execute containment measures, and conduct blameless post-incident reviews to continuously improve system reliability.
Use Cases
When your system experiences P0/P1-level failures, Incident Responder can immediately guide you to establish an incident command structure, assess the scope of impact, perform quick containment actions (e.g., traffic switching, feature degradation, and rolling back changes), and coordinate both internal and external communications.
For teams that are building SRE capabilities, this skill provides a complete incident response framework, including on-call scheduling, alert severity classification, runbook automation, MTTR optimization, and other best practices—helping you establish a mature incident management process.
When dealing with cascading failures in distributed systems, intermittent errors, or performance bottlenecks, Incident Responder guides you to use observability tools (OpenTelemetry, Prometheus, ELK) for correlation analysis, identify service dependencies, trace request flows, and locate the underlying root cause.
Core Features
Provides a “First 5 Minutes Golden Response” framework: quickly assess user impact and business loss, set up the Incident Commander command structure, establish a War Room, and execute rapid containment (circuit breaking, degradation, rollback). Includes a P0–P3 incident classification standard, clearly defining response SLAs and communication requirements for each level.
Incorporates best practices from modern SRE toolchains, including distributed tracing (Jaeger/Zipkin), metrics correlation (Prometheus/Grafana), log aggregation (ELK/Loki), and APM analysis (DataDog), helping you quickly identify bottlenecks, recognize abnormal patterns, and analyze cascading failure paths.
Guides you through high-quality post-mortem analysis, including detailed timeline reconstruction, the “Five Whys” root-cause method, identification of contributing factors (people/process/technical debt), and tracking of actionable improvement items. It also offers long-term improvement recommendations such as error budget management, MTTR/MTTD metric monitoring, and building a team learning culture.
Common Questions
What’s the difference between Incident Responder and a typical operations assistant?
A typical operations assistant usually focuses on technical solutions to single-point issues, while Incident Responder focuses on structured incident management—it not only helps you troubleshoot technical problems, but also emphasizes coordination throughout the entire incident lifecycle: setting up a command structure, managing internal and external communications, making containment decisions, recording decision timelines, and organizing post-incident reviews. It follows SRE core principles, emphasizing “restore service first, then analyze the root cause” and a “blameless culture.”
During a production incident, what should you do in the first 5 minutes?
The first 5 minutes are all about stabilizing the situation:
1) Quickly assess impact scope (number of users, regions, business loss);
2) Establish an Incident Commander as the single decision-maker, assigning a communication lead and a technical lead;
3) Execute the fastest containment measures (roll back the most recent changes, enable circuit breakers, limit traffic, scale up resources);
4) Publish an initial status update.
Avoid deep root-cause analysis while the incident is still active—prioritize restoring service, and leave detailed diagnosis for after the system stabilizes.
What incident management tools does this skill support?
Incident Responder provides integration best practices with mainstream incident management platforms, including PagerDuty (alert escalation and response coordination), Opsgenie (incident management and on-call scheduling), ServiceNow (ITSM integration and change correlation), and Slack/Teams (War Room communication and ChatOps automation). It also covers recommended usage for observability toolchains such as OpenTelemetry, Prometheus, Grafana, ELK, DataDog, and more.