Agent SkillsAgent Skills
anton-abyzov

observability

@anton-abyzov/observability
anton-abyzov
107
10 forks
Updated 3/31/2026
View on GitHub

Observability Engineer - Full-Stack Monitoring Expert: Full-stack observability architect for Prometheus, Grafana, OpenTelemetry, distributed tracing (Jaeger/Tempo), SLIs/SLOs, error budgets, and alerting. Use for metrics, dashboards, traces, or reliability engineering.

Installation

$npx agent-skills-cli install @anton-abyzov/observability
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathplugins/specweave-infrastructure/skills/observability/SKILL.md
Branchdevelop
Scoped Name@anton-abyzov/observability

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


description: Full-stack observability architect for Prometheus, Grafana, OpenTelemetry, distributed tracing (Jaeger/Tempo), SLIs/SLOs, error budgets, and alerting. Use for metrics, dashboards, traces, or reliability engineering. allowed-tools: Read, Write, Edit, Bash model: opus context: fork

Observability Engineer - Full-Stack Monitoring Expert

⚠️ Chunking Rule

Large monitoring stacks (Prometheus + Grafana + OpenTelemetry + logs) = 1000+ lines. Generate ONE component per response: Metrics β†’ Dashboards β†’ Alerting β†’ Tracing β†’ Logs.

Purpose

Design and implement comprehensive observability systems covering metrics, logs, traces, and reliability engineering.

When to Use

  • Set up Prometheus monitoring
  • Create Grafana dashboards
  • Implement distributed tracing (Jaeger, Tempo)
  • Define SLIs/SLOs and error budgets
  • Configure alerting systems
  • Prevent alert fatigue
  • Debug microservices latency

Scope Boundaries

This skill covers OBSERVABILITY STRATEGY: SLIs/SLOs, error budgets, dashboards, alerting design.

  • For OpenTelemetry instrumentation details β†’ use /sw-infra:opentelemetry

Core Concepts

Three Pillars of Observability

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    OBSERVABILITY                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    METRICS      β”‚     LOGS        β”‚        TRACES           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Prometheus      β”‚ Loki/ELK        β”‚ Jaeger/Tempo            β”‚
β”‚ What happened?  β”‚ Why happened?   β”‚ How requests flow?      β”‚
β”‚ Aggregated data β”‚ Event details   β”‚ Request journey         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

RED Method (Services)

  • Rate - Requests per second
  • Errors - Error rate percentage
  • Duration - Latency/response time

USE Method (Resources)

  • Utilization - % time resource is busy
  • Saturation - Queue length/wait time
  • Errors - Error count

Prometheus Setup

Installation (Kubernetes)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=30d

Key Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Recording Rules

groups:
  - name: api_metrics
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))
      - record: job:http_requests_error_rate:percentage
        expr: (sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))) * 100

Grafana Dashboards

Dashboard Design Principles

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Critical Metrics (Big Numbers)     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Key Trends (Time Series)           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Detailed Metrics (Tables/Heatmaps) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Essential Queries

# Request rate
sum(rate(http_requests_total[5m])) by (service)

# Error rate %
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

# P95 Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Distributed Tracing

OpenTelemetry Setup (Node.js)

const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');

const provider = new NodeTracerProvider();
provider.addSpanProcessor(new BatchSpanProcessor(new JaegerExporter()));
provider.register();

registerInstrumentations({
  instrumentations: [new HttpInstrumentation()],
});

Context Propagation

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Jaeger Deployment

# Kubernetes
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
spec:
  strategy: production
  storage:
    type: elasticsearch

SLIs/SLOs

Defining SLOs

slos:
  - name: api_availability
    target: 99.9%  # 43.2 min downtime/month
    window: 28d
    sli: sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

  - name: api_latency_p95
    target: 99%    # 99% requests < 500ms
    window: 28d
    sli: sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

Error Budget

Error Budget = 1 - SLO Target
Example: 99.9% SLO β†’ 0.1% error budget β†’ 43.2 min/month

Burn Rate Alerts

rules:
  - alert: SLOErrorBudgetBurnFast
    expr: slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Fast error budget burn - consuming 2% budget in 1 hour"

Alert Fatigue Prevention

Multi-Window Alerting

# Combine short + long windows to reduce false positives
- alert: HighLatency
  expr: |
    (job:http_request_duration:p95_5m > 1 AND job:http_request_duration:p95_1h > 0.8)
  for: 5m

Severity Levels

SeverityResponseExamples
criticalPage immediatelyService down, data loss
warningReview in hoursDegraded performance
infoDaily reviewCapacity planning

Best Practices

  1. Start with RED/USE methods for consistent metrics
  2. Use recording rules for expensive queries
  3. Implement multi-window alerts to reduce noise
  4. Set achievable SLOs (don't aim for 100%)
  5. Track error budget consistently
  6. Correlate traces with metrics using trace IDs
  7. Sample traces appropriately (1-10% in production)
  8. Add context to spans (user_id, request_id)

Related Skills

  • devops - Infrastructure provisioning