slo-implementation
Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.
Author
Category
Business AnalysisInstall
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
SLO Implementation - Service Level Objective Implementation Guide
Skill Overview
SLO Implementation provides a complete service level objective (SLO) implementation framework to help you define SLI metrics, set reliability targets, calculate error budgets, and configure intelligent alerting rules—enabling data-driven SRE reliability management.
Use Cases
1. Establish Service Reliability Objectives
When you need to set quantifiable reliability goals for microservices, APIs, or applications, this skill offers guidance across the full SLI/SLO/SLA hierarchy. It helps you choose appropriate availability, latency, and durability indicators, and set realistic SLO target values (e.g., 99%, 99.9%, 99.95%) based on business needs.
2. Implement Error Budget Strategy
When you want to balance release speed with system stability, this skill provides error budget calculation formulas and strategy configuration. It helps you define action plans when the budget is exhausted (e.g., normal development, delaying changes, feature freeze), enabling data-driven release decisions.
3. Configure SLO Monitoring and Alerts
When you need to set intelligent alerts based on SLOs, this skill provides Prometheus recording rule templates and alert rule templates. These include multi-window burn-rate detection (fast burn 14.4x, slow burn 6x), which effectively reduces false positives and provides early warnings of reliability issues.
Core Features
1. SLI Metric Definitions
Includes PromQL query templates for common SLI types such as availability, latency, and durability. It supports 28-day rolling window calculations to accurately measure real user-perceived service quality. Out-of-the-box metric definitions include HTTP request success rate, P95 latency compliance rate, storage write success rate, and more.
2. Error Budget Management
Automatically calculates remaining error budget (formula: (SLI - SLO target) / (1 - SLO target) * 100), tracks budget consumption rate in real time, and supports burn-rate analysis across multiple time windows (5 minutes, 1 hour, 6 hours). This gives you clear visibility into how many errors you can still tolerate.
3. SLO Alerts and Visualization
Provides Grafana dashboard structure templates and PromQL query examples to visualize SLO compliance, remaining error budget, SLI trends, and burn-rate analysis. Combined with multi-window alerting rules, alerts are triggered in time when the error budget is rapidly consumed—helping you avoid SLO violations.
Frequently Asked Questions
What are the differences between SLO and SLA?
SLA (Service Level Agreement) is a service-level agreement signed with customers—an external commercial commitment. SLO (Service Level Objective) is an internally set reliability target used to drive engineering practices. SLI (Service Level Indicator) is the metric value measured in practice. Together, they form a hierarchy of “customer commitment → internal goals → actual measurement.”
How do I choose the right SLO target value?
Choosing an SLO involves considering user expectations, business requirements, current performance, reliability cost, and competitor benchmarks. Common targets are 99% (7.2 hours of downtime per month), 99.9% (43.2 minutes), and 99.95% (21.6 minutes). Avoid aiming for 100% availability, since it increases costs indefinitely and hinders innovation. It’s recommended to start with a value slightly higher than your current performance and adjust gradually.
What is a multi-window burn-rate alert?
Multi-window alerting combines short-window (e.g., 1 hour) and long-window (e.g., 6 hours) burn-rate detection to reduce false positives. For example: a fast-burn alert requires that the burn rate in both the 1-hour and 5-minute windows exceed 14.4x (meaning 2% error budget consumed within 1 hour). A slow-burn alert requires that the burn rate in both the 6-hour and 30-minute windows exceed 6x. This approach effectively filters out transient fluctuations and alerts only when problems persist.