service-mesh-observability
Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.
Author
Category
Development ToolsInstall
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
Service Mesh Observability Configuration Guide
Skills Overview
The Service Mesh observability skill provides a complete monitoring solution for Istio and Linkerd service meshes, including best practices for distributed tracing, metrics collection, service dependency visualization, and alert configuration.
Applicable Scenarios
1. Service Mesh Monitoring Deployment
When you need to establish a complete observability system for a Service Mesh in a Kubernetes cluster, use this skill. It covers configuration approaches for core components such as Prometheus metrics collection, Jaeger distributed tracing, and Kiali service topology visualization, helping you quickly set up a production-grade monitoring platform.
2. Latency and Fault Troubleshooting
When inter-service calls face performance bottlenecks or abnormal behavior, this skill provides a full troubleshooting workflow. By using distributed tracing to pinpoint specific time-consuming stages, querying error rate and latency metrics with PromQL, and combining Grafana dashboards to quickly identify root causes.
3. Defining Service Communication SLOs
When you need to define service quality objectives for your service mesh, this skill provides configuration of monitoring thresholds for the “golden signals” (latency, traffic, errors, and saturation), Prometheus alert rule writing, and an OpenTelemetry integration approach. It helps you build a reliable service-level monitoring system.
Core Features
Distributed Tracing Configuration
Provides complete integration configurations for Istio and Linkerd with Jaeger/Zipkin, including tracing sampling rate settings (100% for development, 1–10% for production), span context propagation, and methods for analyzing tracing data—helping you achieve end-to-end tracing of service call chains.
Metrics Collection and Querying
Includes built-in Service Mesh core PromQL query templates covering key metrics such as request rate, error rate, P99 latency, and TCP connection counts. Paired with Prometheus Operator ServiceMonitor configuration, it enables automated metrics collection and visualization dashboards.
Service Dependency Visualization
Provides Kiali service topology diagram configuration and Linkerd Viz real-time traffic analysis commands to help you intuitively view service dependency relationships, real-time traffic status, and per-route metric analysis—effectively supporting operational and decision-making for microservices architectures.
Common Questions
What are the three pillars of Service Mesh observability?
The three pillars of observability are metrics, traces, and logs. Metrics focus on numeric data such as request rate, error rate, and latency. Traces record the complete call chain across services using span context, helping locate performance bottlenecks and dependency relationships. Logs provide access records, error details, and audit information. When used together in a Service Mesh, these enable comprehensive service monitoring and fault diagnostics.
How should the trace sampling rate be set in production?
In production, it is recommended to set a 1–10% trace sampling rate. 100% sampling generates massive data, significantly increasing storage costs and query overhead, whereas appropriate sampling can still capture most performance issues. Development and test environments can use 100% sampling to obtain complete trace data. In Istio, use the tracing.sampling parameter to control sampling; in Linkerd, configure it with environment variables. For critical business traffic, consider intelligent sampling strategies based on service or headers.
How do I use PromQL to query the error rate of a service mesh?
Use the following PromQL query to calculate the 5xx error rate:
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100This query aggregates the proportion of error requests over a 5-minute window per service and returns a percentage value. It is recommended to pair it with Grafana panels to set threshold-based alerts (e.g., trigger when >1%). For Linkerd, use similar queries but with the linkerd_* metric series.
Which Service Mesh platforms does this skill support?
It mainly supports the two most widely used Service Mesh platforms: Istio and Linkerd. Istio covers parts including Telemetry v2 API, Prometheus integration, Jaeger tracing, and Kiali visualization; Linkerd covers parts including the viz extension, CLI monitoring commands, and tap traffic inspection. It also provides a generic OpenTelemetry Collector integration approach, which can be used to support other mesh implementations compatible with the OTel protocol.
How can I reduce storage costs for Service Mesh observability?
It is recommended to adopt the following strategies: set sampling rates appropriately (1–10% in production), control metric cardinality (avoid high-cardinality labels), use hot/cold storage tiering (high-performance storage for recent data, low-cost object storage for historical data), and set data retention policies (e.g., retain trace data for 7–30 days). Also monitor the resource consumption of the observability components themselves to ensure the observability system does not become a burden on the business.