prometheus-configuration

Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.

View Source
name:prometheus-configurationdescription:Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.

Prometheus Configuration

Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.

Do not use this skill when

  • The task is unrelated to prometheus configuration

  • You need a different domain or tool outside this scope
  • Instructions

  • Clarify goals, constraints, and required inputs.

  • Apply relevant best practices and validate outcomes.

  • Provide actionable steps and verification.

  • If detailed examples are required, open resources/implementation-playbook.md.
  • Purpose

    Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.

    Use this skill when

  • Set up Prometheus monitoring

  • Configure metric scraping

  • Create recording rules

  • Design alert rules

  • Implement service discovery
  • Prometheus Architecture

    ┌──────────────┐
    │ Applications │ ← Instrumented with client libraries
    └──────┬───────┘
    │ /metrics endpoint

    ┌──────────────┐
    │ Prometheus │ ← Scrapes metrics periodically
    │ Server │
    └──────┬───────┘

    ├─→ AlertManager (alerts)
    ├─→ Grafana (visualization)
    └─→ Long-term storage (Thanos/Cortex)

    Installation

    Kubernetes with Helm

    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update

    helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace monitoring \
    --create-namespace \
    --set prometheus.prometheusSpec.retention=30d \
    --set prometheus.prometheusSpec.storageVolumeSize=50Gi

    Docker Compose

    version: '3.8'
    services:
    prometheus:
    image: prom/prometheus:latest
    ports:
    - "9090:9090"
    volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml
    - prometheus-data:/prometheus
    command:
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--storage.tsdb.path=/prometheus'
    - '--storage.tsdb.retention.time=30d'

    volumes:
    prometheus-data:

    Configuration File

    prometheus.yml:

    global:
    scrape_interval: 15s
    evaluation_interval: 15s
    external_labels:
    cluster: 'production'
    region: 'us-west-2'

    Alertmanager configuration


    alerting:
    alertmanagers:
    - static_configs:
    - targets:
    - alertmanager:9093

    Load rules files


    rule_files:
    - /etc/prometheus/rules/.yml

    Scrape configurations


    scrape_configs:
    # Prometheus itself
    - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

    # Node exporters
    - job_name: 'node-exporter'
    static_configs:
    - targets:
    - 'node1:9100'
    - 'node2:9100'
    - 'node3:9100'
    relabel_configs:
    - source_labels: [__address__]
    target_label: instance
    regex: '([^:]+)(:[0-9]+)?'
    replacement: '${1}'

    # Kubernetes pods with annotations
    - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
    - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace
    - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: pod

    # Application metrics
    - job_name: 'my-app'
    static_configs:
    - targets:
    - 'app1.example.com:9090'
    - 'app2.example.com:9090'
    metrics_path: '/metrics'
    scheme: 'https'
    tls_config:
    ca_file: /etc/prometheus/ca.crt
    cert_file: /etc/prometheus/client.crt
    key_file: /etc/prometheus/client.key

    Reference: See assets/prometheus.yml.template

    Scrape Configurations

    Static Targets

    scrape_configs:
    - job_name: 'static-targets'
    static_configs:
    - targets: ['host1:9100', 'host2:9100']
    labels:
    env: 'production'
    region: 'us-west-2'

    File-based Service Discovery

    scrape_configs:
    - job_name: 'file-sd'
    file_sd_configs:
    - files:
    - /etc/prometheus/targets/
    .json
    - /etc/prometheus/targets/.yml
    refresh_interval: 5m

    targets/production.json:

    [
    {
    "targets": ["app1:9090", "app2:9090"],
    "labels": {
    "env": "production",
    "service": "api"
    }
    }
    ]

    Kubernetes Service Discovery

    scrape_configs:
    - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
    - role: service
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)

    Reference: See references/scrape-configs.md

    Recording Rules

    Create pre-computed metrics for frequently queried expressions:

    # /etc/prometheus/rules/recording_rules.yml
    groups:
    - name: api_metrics
    interval: 15s
    rules:
    # HTTP request rate per service
    - record: job:http_requests:rate5m
    expr: sum by (job) (rate(http_requests_total[5m]))

    # Error rate percentage
    - record: job:http_requests_errors:rate5m
    expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

    - record: job:http_requests_error_rate:percentage
    expr: |
    (job:http_requests_errors:rate5m / job:http_requests:rate5m)
    100

    # P95 latency
    - record: job:http_request_duration:p95
    expr: |
    histogram_quantile(0.95,
    sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
    )

    - name: resource_metrics
    interval: 30s
    rules:
    # CPU utilization percentage
    - record: instance:node_cpu:utilization
    expr: |
    100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) 100)

    # Memory utilization percentage
    - record: instance:node_memory:utilization
    expr: |
    100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
    100)

    # Disk usage percentage
    - record: instance:node_disk:utilization
    expr: |
    100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) 100)

    Reference: See references/recording-rules.md

    Alert Rules

    # /etc/prometheus/rules/alert_rules.yml
    groups:
    - name: availability
    interval: 30s
    rules:
    - alert: ServiceDown
    expr: up{job="my-app"} == 0
    for: 1m
    labels:
    severity: critical
    annotations:
    summary: "Service {{ $labels.instance }} is down"
    description: "{{ $labels.job }} has been down for more than 1 minute"

    - alert: HighErrorRate
    expr: job:http_requests_error_rate:percentage > 5
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High error rate for {{ $labels.job }}"
    description: "Error rate is {{ $value }}% (threshold: 5%)"

    - alert: HighLatency
    expr: job:http_request_duration:p95 > 1
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High latency for {{ $labels.job }}"
    description: "P95 latency is {{ $value }}s (threshold: 1s)"

    - name: resources
    interval: 1m
    rules:
    - alert: HighCPUUsage
    expr: instance:node_cpu:utilization > 80
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High CPU usage on {{ $labels.instance }}"
    description: "CPU usage is {{ $value }}%"

    - alert: HighMemoryUsage
    expr: instance:node_memory:utilization > 85
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "High memory usage on {{ $labels.instance }}"
    description: "Memory usage is {{ $value }}%"

    - alert: DiskSpaceLow
    expr: instance:node_disk:utilization > 90
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "Low disk space on {{ $labels.instance }}"
    description: "Disk usage is {{ $value }}%"

    Validation

    # Validate configuration
    promtool check config prometheus.yml

    Validate rules


    promtool check rules /etc/prometheus/rules/
    .yml

    Test query


    promtool query instant http://localhost:9090 'up'

    Reference: See scripts/validate-prometheus.sh

    Best Practices

  • Use consistent naming for metrics (prefix_name_unit)

  • Set appropriate scrape intervals (15-60s typical)

  • Use recording rules for expensive queries

  • Implement high availability (multiple Prometheus instances)

  • Configure retention based on storage capacity

  • Use relabeling for metric cleanup

  • Monitor Prometheus itself

  • Implement federation for large deployments

  • Use Thanos/Cortex for long-term storage

  • Document custom metrics
  • Troubleshooting

    Check scrape targets:

    curl http://localhost:9090/api/v1/targets

    Check configuration:

    curl http://localhost:9090/api/v1/status/config

    Test query:

    curl 'http://localhost:9090/api/v1/query?query=up'

    Reference Files

  • assets/prometheus.yml.template - Complete configuration template

  • references/scrape-configs.md - Scrape configuration patterns

  • references/recording-rules.md - Recording rule examples

  • scripts/validate-prometheus.sh - Validation script
  • Related Skills

  • grafana-dashboards - For visualization

  • slo-implementation - For SLO monitoring

  • distributed-tracing - For request tracing

    1. prometheus-configuration - Agent Skills