incident-response-incident-response

Use when working with incident response incident response

Author

Category

Other Tools

Install

Hot:6

Download and extract to your skills directory

Copy command and send to OpenClaw for auto-install:

Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-incident-response-incident-response&locale=en&source=copy

Incident Response - Multi-Agent Incident Response and Fault Handling Skills

Skill Overview


This is an incident response skill based on modern SRE practices. It uses multi-agent collaboration to quickly handle production incidents, enabling end-to-end automation from detection, analysis, and repair to post-incident review.

Applicable Scenarios

1. Production Environment Fault Response


When production systems experience service downtime, performance degradation, or security incidents, this skill can quickly launch a multi-agent collaborative workflow. It automatically classifies the incident, performs observability analysis, and identifies the root cause, helping the team restore service within the required SLA timeframe.

2. SRE Team Incident Management


Suitable for teams following SRE best practices. By defining severity levels (P0–P3) and standardized response procedures, it ensures every incident is handled promptly and in an orderly manner. Through blameless post-incident reviews, it turns experience into team knowledge.

3. DevOps Troubleshooting and Post-Incident Review


For scenarios where the goal is to quickly identify the root cause and generate a post-incident review document, the skill coordinates multiple specialized agents—covering debugging, observability, security, performance, and more—providing comprehensive fault analysis and improvement recommendations.

Core Capabilities

Incident command system powered by multi-agent collaboration


Implements a complete Incident Command System (ICS), covering 13 key steps: incident detection and classification, observability analysis, deep system troubleshooting, security assessment, performance analysis, remediation implementation, deployment verification, stakeholder communication, and post-incident review. Each step is handled by a dedicated agent.

Automated incident classification and impact assessment


Based on monitoring data and SLO violations, it automatically categorizes incidents into four levels: P0 (complete outage/security vulnerability) to P3 (interface-only issues). It also evaluates the scope of impact on users and the business, providing a basis for subsequent response decisions.

Blameless post-incident review and continuous improvement


After incident resolution, it automatically generates a blameless post-incident review document that meets SRE standards. It records a complete timeline, root cause, and the strengths and weaknesses of the response process, and generates specific action items and monitoring improvement recommendations—ensuring that every incident becomes a learning opportunity for the team.

Common Questions

What kind of teams is this skill suitable for?


It is suitable for SRE, operations, or DevOps teams of a certain size—especially those that need to handle complex distributed system failures and want to standardize and automate incident handling workflows. If your system has a smaller user base and incidents are infrequent, you may not need such a comprehensive process.

How does the skill determine the severity level of an incident?


The skill makes a composite assessment by analyzing monitoring alerts, SLO violations, the scope of user impact, and business risk. P0 indicates a complete service interruption, a security vulnerability, or data loss—requiring an immediate company-wide response. P1 indicates severe performance degradation. P2 and P3 indicate relatively limited impact.

When is a blameless post-incident review needed?


Any P0 or P1 incident should have a blameless post-incident review. For P2 and P3 incidents, if they involve new failure modes, expose system weaknesses, or have important lessons worth sharing, it is also recommended to conduct a review. The core of a blameless review is to focus on system issues rather than individual accountability.