devops-troubleshooter
Expert DevOps troubleshooter specializing in rapid incident response, advanced debugging, and modern observability. Masters log analysis, distributed tracing, Kubernetes debugging, performance optimization, and root cause analysis. Handles production outages, system reliability, and preventive monitoring. Use PROACTIVELY for debugging, incident response, or system troubleshooting.
Author
Category
Development ToolsInstall
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
DevOps Troubleshooter - Intelligent Production Incident Troubleshooting Assistant
Skill Overview
DevOps Troubleshooter is a professional production incident response and system debugging assistant, proficient in log analysis, distributed tracing, Kubernetes debugging, performance optimization, and root cause analysis. It helps you quickly locate and resolve various technical issues in production environments.
Applicable Scenarios
1. Production Incident Emergency Response
When a production service suddenly goes down, APIs time out, or user complaints surge, this skill can guide you to rapidly complete incident assessment, data collection, hypothesis validation, and emergency remediation to minimize business impact. It supports quickly retrieving key data from platforms like CloudWatch, Azure Monitor, and GCP Cloud Logging, and provides a systematic incident troubleshooting workflow.
2. Kubernetes and Container Environment Debugging
For common issues in Kubernetes clusters—such as frequent Pod restarts (OOMKilled), CPU throttling, network connectivity anomalies, and storage mount failures—this provides professional kubectl debugging commands and diagnostic approaches. It supports problem localization for container runtimes like Docker, containerd, and CRI-O, as well as traffic and security troubleshooting for service meshes like Istio and Linkerd.
3. Microservice Performance Bottleneck Analysis
By using distributed tracing data (Jaeger, Zipkin, AWS X-Ray, OpenTelemetry) to locate performance bottlenecks in microservice architectures, this analyzes inter-service call chains, dependencies, and latency distributions. Combined with APM tools (DataDog, New Relic, Dynatrace) for deep performance profiling, it helps identify root causes such as memory leaks, CPU hotspots, and garbage collection issues.
Core Features
Full-stack Observability Analysis
Integrates the three pillars of observability—logs, metrics, and traces—extracting valuable information from log platforms like ELK Stack, Loki/Grafana, Fluentd; monitoring systems like Prometheus, Grafana, InfluxDB; and various APM and tracing tools to perform multi-dimensional correlation analysis and quickly pinpoint root causes.
Systematic Incident Troubleshooting Methodology
Following SRE best practices, it adopts a systematic "collect facts before forming hypotheses" approach, guiding users to validate hypotheses with minimal impact and emphasizing thorough documentation and blameless post-incident analysis. It not only resolves the current issue but also recommends adding monitoring and alerts to prevent recurrence.
Multi-cloud and Hybrid Cloud Environment Support
Covers debugging scenarios across the three major cloud providers—AWS, Azure, GCP—and hybrid cloud environments, including cloud-service-specific issues, cross-cloud communication failures, identity federation problems, and serverless architecture debugging. It also supports infrastructure-level troubleshooting such as Terraform state issues, Ansible playbook failures, and Vault integration.
Frequently Asked Questions
When a production outage occurs, what process should be followed for troubleshooting?
The recommended troubleshooting process has nine steps: (1) assess incident severity based on impact scope; (2) collect comprehensive data from logs, metrics, traces, and system state; (3) form systematic hypotheses and validate them in ways that minimize impact; (4) implement emergency recovery measures while planning permanent fixes; (5) document all findings in detail for post-incident review; (6) add monitoring and alerts to prevent recurrence; (7) plan long-term improvements to increase system resilience; (8) share knowledge through runbooks and documentation; (9) perform a blameless post-incident analysis to identify systemic improvement opportunities.
What causes Kubernetes Pods to frequently restart with OOMKilled?
OOMKilled is usually caused by: (1) container memory limits configured too low to meet actual application needs; (2) application memory leaks causing continuous growth until limits are exceeded; (3) improper JVM heap configuration so GC cannot effectively reclaim memory; (4) sudden spikes in concurrent requests causing memory peaks; (5) cache sizes exceeding expectations. Troubleshooting suggestions: use kubectl describe pod to view events, analyze application logs with kubectl logs, and when necessary use kubectl exec to enter the container and run memory analysis tools (e.g., top, ps, jmap). Also check the application's memory settings and caching strategy.
How to use distributed tracing to locate microservice performance bottlenecks?
Distributed tracing locates performance issues by following the full request path across microservices. Specific steps: (1) ensure services have tracing enabled via OpenTelemetry, Jaeger, Zipkin, or similar; (2) search the tracing platform for high-latency or high-error-rate traces; (3) analyze the call chain (spans) to find the service or operation that consumes the most time; (4) check for serial calls, redundant calls, or unnecessary data transfer; (5) combine traces with application logs and metrics to deeply analyze the root cause of slow operations; (6) validate improvements through tracing data after optimization. Common tips include setting an appropriate sampling rate, adding key business tags, and paying attention to cross-service call timeouts.