service-mesh-observability

为服务网格实施全面的可观测性方案,涵盖分布式追踪、性能指标收集与可视化呈现。适用于搭建网格监控体系、诊断延迟问题或为服务间通信设定服务等级目标(SLO)等场景。

查看详情
name:service-mesh-observabilitydescription:Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.

Service Mesh Observability

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

Do not use this skill when

  • The task is unrelated to service mesh observability

  • You need a different domain or tool outside this scope
  • Instructions

  • Clarify goals, constraints, and required inputs.

  • Apply relevant best practices and validate outcomes.

  • Provide actionable steps and verification.

  • If detailed examples are required, open resources/implementation-playbook.md.
  • Use this skill when

  • Setting up distributed tracing across services

  • Implementing service mesh metrics and dashboards

  • Debugging latency and error issues

  • Defining SLOs for service communication

  • Visualizing service dependencies

  • Troubleshooting mesh connectivity
  • Core Concepts

    1. Three Pillars of Observability

    ┌─────────────────────────────────────────────────────┐
    │ Observability │
    ├─────────────────┬─────────────────┬─────────────────┤
    │ Metrics │ Traces │ Logs │
    │ │ │ │
    │ • Request rate │ • Span context │ • Access logs │
    │ • Error rate │ • Latency │ • Error details │
    │ • Latency P50 │ • Dependencies │ • Debug info │
    │ • Saturation │ • Bottlenecks │ • Audit trail │
    └─────────────────┴─────────────────┴─────────────────┘

    2. Golden Signals for Mesh

    SignalDescriptionAlert Threshold
    LatencyRequest duration P50, P99P99 > 500ms
    TrafficRequests per secondAnomaly detection
    Errors5xx error rate> 1%
    SaturationResource utilization> 80%

    Templates

    Template 1: Istio with Prometheus & Grafana

    # Install Prometheus
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: prometheus
    namespace: istio-system
    data:
    prometheus.yml: |
    global:
    scrape_interval: 15s
    scrape_configs:
    - job_name: 'istio-mesh'
    kubernetes_sd_configs:
    - role: endpoints
    namespaces:
    names:
    - istio-system
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_name]
    action: keep
    regex: istio-telemetry


    ServiceMonitor for Prometheus Operator


    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
    name: istio-mesh
    namespace: istio-system
    spec:
    selector:
    matchLabels:
    app: istiod
    endpoints:
    - port: http-monitoring
    interval: 15s

    Template 2: Key Istio Metrics Queries

    # Request rate by service
    sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)

    Error rate (5xx)


    sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
    / sum(rate(istio_requests_total{reporter="destination"}[5m])) 100

    P99 latency


    histogram_quantile(0.99,
    sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
    by (le, destination_service_name))

    TCP connections


    sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)

    Request size


    histogram_quantile(0.99,
    sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
    by (le, destination_service_name))

    Template 3: Jaeger Distributed Tracing

    # Jaeger installation for Istio
    apiVersion: install.istio.io/v1alpha1
    kind: IstioOperator
    spec:
    meshConfig:
    enableTracing: true
    defaultConfig:
    tracing:
    sampling: 100.0 # 100% in dev, lower in prod
    zipkin:
    address: jaeger-collector.istio-system:9411


    Jaeger deployment


    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: jaeger
    namespace: istio-system
    spec:
    selector:
    matchLabels:
    app: jaeger
    template:
    metadata:
    labels:
    app: jaeger
    spec:
    containers:
    - name: jaeger
    image: jaegertracing/all-in-one:1.50
    ports:
    - containerPort: 5775 # UDP
    - containerPort: 6831 # Thrift
    - containerPort: 6832 # Thrift
    - containerPort: 5778 # Config
    - containerPort: 16686 # UI
    - containerPort: 14268 # HTTP
    - containerPort: 14250 # gRPC
    - containerPort: 9411 # Zipkin
    env:
    - name: COLLECTOR_ZIPKIN_HOST_PORT
    value: ":9411"

    Template 4: Linkerd Viz Dashboard

    # Install Linkerd viz extension
    linkerd viz install | kubectl apply -f -

    Access dashboard


    linkerd viz dashboard

    CLI commands for observability


    Top requests


    linkerd viz top deploy/my-app

    Per-route metrics


    linkerd viz routes deploy/my-app --to deploy/backend

    Live traffic inspection


    linkerd viz tap deploy/my-app --to deploy/backend

    Service edges (dependencies)


    linkerd viz edges deployment -n my-namespace

    Template 5: Grafana Dashboard JSON

    {
    "dashboard": {
    "title": "Service Mesh Overview",
    "panels": [
    {
    "title": "Request Rate",
    "type": "graph",
    "targets": [
    {
    "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
    "legendFormat": "{{destination_service_name}}"
    }
    ]
    },
    {
    "title": "Error Rate",
    "type": "gauge",
    "targets": [
    {
    "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m]))
    100"
    }
    ],
    "fieldConfig": {
    "defaults": {
    "thresholds": {
    "steps": [
    {"value": 0, "color": "green"},
    {"value": 1, "color": "yellow"},
    {"value": 5, "color": "red"}
    ]
    }
    }
    }
    },
    {
    "title": "P99 Latency",
    "type": "graph",
    "targets": [
    {
    "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
    "legendFormat": "{{destination_service_name}}"
    }
    ]
    },
    {
    "title": "Service Topology",
    "type": "nodeGraph",
    "targets": [
    {
    "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
    }
    ]
    }
    ]
    }
    }

    Template 6: Kiali Service Mesh Visualization

    # Kiali installation
    apiVersion: kiali.io/v1alpha1
    kind: Kiali
    metadata:
    name: kiali
    namespace: istio-system
    spec:
    auth:
    strategy: anonymous # or openid, token
    deployment:
    accessible_namespaces:
    - ""
    external_services:
    prometheus:
    url: http://prometheus.istio-system:9090
    tracing:
    url: http://jaeger-query.istio-system:16686
    grafana:
    url: http://grafana.istio-system:3000

    Template 7: OpenTelemetry Integration

    # OpenTelemetry Collector for mesh
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: otel-collector-config
    data:
    config.yaml: |
    receivers:
    otlp:
    protocols:
    grpc:
    endpoint: 0.0.0.0:4317
    http:
    endpoint: 0.0.0.0:4318
    zipkin:
    endpoint: 0.0.0.0:9411

    processors:
    batch:
    timeout: 10s

    exporters:
    jaeger:
    endpoint: jaeger-collector:14250
    tls:
    insecure: true
    prometheus:
    endpoint: 0.0.0.0:8889

    service:
    pipelines:
    traces:
    receivers: [otlp, zipkin]
    processors: [batch]
    exporters: [jaeger]
    metrics:
    receivers: [otlp]
    processors: [batch]
    exporters: [prometheus]



    Istio Telemetry v2 with OTel


    apiVersion: telemetry.istio.io/v1alpha1
    kind: Telemetry
    metadata:
    name: mesh-default
    namespace: istio-system
    spec:
    tracing:
    - providers:
    - name: otel
    randomSamplingPercentage: 10

    Alerting Rules

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
    name: mesh-alerts
    namespace: istio-system
    spec:
    groups:
    - name: mesh.rules
    rules:
    - alert: HighErrorRate
    expr: |
    sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
    / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "High error rate for {{ $labels.destination_service_name }}"

    - alert: HighLatency
    expr: |
    histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
    by (le, destination_service_name)) > 1000
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High P99 latency for {{ $labels.destination_service_name }}"

    - alert: MeshCertExpiring
    expr: |
    (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
    labels:
    severity: warning
    annotations:
    summary: "Mesh certificate expiring in less than 7 days"

    Best Practices

    Do's


  • Sample appropriately - 100% in dev, 1-10% in prod

  • Use trace context - Propagate headers consistently

  • Set up alerts - For golden signals

  • Correlate metrics/traces - Use exemplars

  • Retain strategically - Hot/cold storage tiers
  • Don'ts


  • Don't over-sample - Storage costs add up

  • Don't ignore cardinality - Limit label values

  • Don't skip dashboards - Visualize dependencies

  • Don't forget costs** - Monitor observability costs
  • Resources

  • Istio Observability

  • Linkerd Observability

  • OpenTelemetry

  • Kiali

    1. service-mesh-observability - Agent Skills