distributed-tracing

Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.

View Source
name:distributed-tracingdescription:Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.

Distributed Tracing

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.

Do not use this skill when

  • The task is unrelated to distributed tracing

  • You need a different domain or tool outside this scope
  • Instructions

  • Clarify goals, constraints, and required inputs.

  • Apply relevant best practices and validate outcomes.

  • Provide actionable steps and verification.

  • If detailed examples are required, open resources/implementation-playbook.md.
  • Purpose

    Track requests across distributed systems to understand latency, dependencies, and failure points.

    Use this skill when

  • Debug latency issues

  • Understand service dependencies

  • Identify bottlenecks

  • Trace error propagation

  • Analyze request paths
  • Distributed Tracing Concepts

    Trace Structure


    Trace (Request ID: abc123)

    Span (frontend) [100ms]

    Span (api-gateway) [80ms]
    ├→ Span (auth-service) [10ms]
    └→ Span (user-service) [60ms]
    └→ Span (database) [40ms]

    Key Components


  • Trace - End-to-end request journey

  • Span - Single operation within a trace

  • Context - Metadata propagated between services

  • Tags - Key-value pairs for filtering

  • Logs - Timestamped events within a span
  • Jaeger Setup

    Kubernetes Deployment

    # Deploy Jaeger Operator
    kubectl create namespace observability
    kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability

    Deploy Jaeger instance


    kubectl apply -f - <<EOF
    apiVersion: jaegertracing.io/v1
    kind: Jaeger
    metadata:
    name: jaeger
    namespace: observability
    spec:
    strategy: production
    storage:
    type: elasticsearch
    options:
    es:
    server-urls: http://elasticsearch:9200
    ingress:
    enabled: true
    EOF

    Docker Compose

    version: '3.8'
    services:
    jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
    - "5775:5775/udp"
    - "6831:6831/udp"
    - "6832:6832/udp"
    - "5778:5778"
    - "16686:16686" # UI
    - "14268:14268" # Collector
    - "14250:14250" # gRPC
    - "9411:9411" # Zipkin
    environment:
    - COLLECTOR_ZIPKIN_HOST_PORT=:9411

    Reference: See references/jaeger-setup.md

    Application Instrumentation

    OpenTelemetry (Recommended)

    Python (Flask)

    from opentelemetry import trace
    from opentelemetry.exporter.jaeger.thrift import JaegerExporter
    from opentelemetry.sdk.resources import SERVICE_NAME, Resource
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import BatchSpanProcessor
    from opentelemetry.instrumentation.flask import FlaskInstrumentor
    from flask import Flask

    Initialize tracer


    resource = Resource(attributes={SERVICE_NAME: "my-service"})
    provider = TracerProvider(resource=resource)
    processor = BatchSpanProcessor(JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
    ))
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)

    Instrument Flask


    app = Flask(__name__)
    FlaskInstrumentor().instrument_app(app)

    @app.route('/api/users')
    def get_users():
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("get_users") as span:
    span.set_attribute("user.count", 100)
    # Business logic
    users = fetch_users_from_db()
    return {"users": users}

    def fetch_users_from_db():
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("database_query") as span:
    span.set_attribute("db.system", "postgresql")
    span.set_attribute("db.statement", "SELECT FROM users")
    # Database query
    return query_database()

    Node.js (Express)

    const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
    const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
    const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
    const { registerInstrumentations } = require('@opentelemetry/instrumentation');
    const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
    const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

    // Initialize tracer
    const provider = new NodeTracerProvider({
    resource: { attributes: { 'service.name': 'my-service' } }
    });

    const exporter = new JaegerExporter({
    endpoint: 'http://jaeger:14268/api/traces'
    });

    provider.addSpanProcessor(new BatchSpanProcessor(exporter));
    provider.register();

    // Instrument libraries
    registerInstrumentations({
    instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
    ],
    });

    const express = require('express');
    const app = express();

    app.get('/api/users', async (req, res) => {
    const tracer = trace.getTracer('my-service');
    const span = tracer.startSpan('get_users');

    try {
    const users = await fetchUsers();
    span.setAttributes({ 'user.count': users.length });
    res.json({ users });
    } finally {
    span.end();
    }
    });

    Go

    package main

    import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
    )

    func initTracer() (sdktrace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
    jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
    ))
    if err != nil {
    return nil, err
    }

    tp := sdktrace.NewTracerProvider(
    sdktrace.WithBatcher(exporter),
    sdktrace.WithResource(resource.NewWithAttributes(
    semconv.SchemaURL,
    semconv.ServiceNameKey.String("my-service"),
    )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
    }

    func getUsers(ctx context.Context) ([]User, error) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "get_users")
    defer span.End()

    span.SetAttributes(attribute.String("user.filter", "active"))

    users, err := fetchUsersFromDB(ctx)
    if err != nil {
    span.RecordError(err)
    return nil, err
    }

    span.SetAttributes(attribute.Int("user.count", len(users)))
    return users, nil
    }

    Reference: See references/instrumentation.md

    Context Propagation

    HTTP Headers


    traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
    tracestate: congo=t61rcWkgMzE

    Propagation in HTTP Requests

    Python

    from opentelemetry.propagate import inject

    headers = {}
    inject(headers) # Injects trace context

    response = requests.get('http://downstream-service/api', headers=headers)

    Node.js

    const { propagation } = require('@opentelemetry/api');

    const headers = {};
    propagation.inject(context.active(), headers);

    axios.get('http://downstream-service/api', { headers });

    Tempo Setup (Grafana)

    Kubernetes Deployment

    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: tempo-config
    data:
    tempo.yaml: |
    server:
    http_listen_port: 3200

    distributor:
    receivers:
    jaeger:
    protocols:
    thrift_http:
    grpc:
    otlp:
    protocols:
    http:
    grpc:

    storage:
    trace:
    backend: s3
    s3:
    bucket: tempo-traces
    endpoint: s3.amazonaws.com

    querier:
    frontend_worker:
    frontend_address: tempo-query-frontend:9095



    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: tempo
    spec:
    replicas: 1
    template:
    spec:
    containers:
    - name: tempo
    image: grafana/tempo:latest
    args:
    - -config.file=/etc/tempo/tempo.yaml
    volumeMounts:
    - name: config
    mountPath: /etc/tempo
    volumes:
    - name: config
    configMap:
    name: tempo-config

    Reference: See assets/jaeger-config.yaml.template

    Sampling Strategies

    Probabilistic Sampling


    # Sample 1% of traces
    sampler:
    type: probabilistic
    param: 0.01

    Rate Limiting Sampling


    # Sample max 100 traces per second
    sampler:
    type: ratelimiting
    param: 100

    Adaptive Sampling


    from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

    Sample based on trace ID (deterministic)


    sampler = ParentBased(root=TraceIdRatioBased(0.01))

    Trace Analysis

    Finding Slow Requests

    Jaeger Query:

    service=my-service
    duration > 1s

    Finding Errors

    Jaeger Query:

    service=my-service
    error=true
    tags.http.status_code >= 500

    Service Dependency Graph

    Jaeger automatically generates service dependency graphs showing:

  • Service relationships

  • Request rates

  • Error rates

  • Average latencies
  • Best Practices

  • Sample appropriately (1-10% in production)

  • Add meaningful tags (user_id, request_id)

  • Propagate context across all service boundaries

  • Log exceptions in spans

  • Use consistent naming for operations

  • Monitor tracing overhead (<1% CPU impact)

  • Set up alerts for trace errors

  • Implement distributed context (baggage)

  • Use span events for important milestones

  • Document instrumentation standards
  • Integration with Logging

    Correlated Logs


    import logging
    from opentelemetry import trace

    logger = logging.getLogger(__name__)

    def process_request():
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id

    logger.info(
    "Processing request",
    extra={"trace_id": format(trace_id, '032x')}
    )

    Troubleshooting

    No traces appearing:

  • Check collector endpoint

  • Verify network connectivity

  • Check sampling configuration

  • Review application logs
  • High latency overhead:

  • Reduce sampling rate

  • Use batch span processor

  • Check exporter configuration
  • Reference Files

  • references/jaeger-setup.md - Jaeger installation

  • references/instrumentation.md - Instrumentation patterns

  • assets/jaeger-config.yaml.template - Jaeger configuration
  • Related Skills

  • prometheus-configuration - For metrics

  • grafana-dashboards - For visualization

  • slo-implementation - For latency SLOs