observability

@anton-abyzov/observability

anton-abyzov

107

10 forks

Updated 3/31/2026

View on GitHub

Observability Engineer - Full-Stack Monitoring Expert: Full-stack observability architect for Prometheus, Grafana, OpenTelemetry, distributed tracing (Jaeger/Tempo), SLIs/SLOs, error budgets, and alerting. Use for metrics, dashboards, traces, or reliability engineering.

Installation

$npx agent-skills-cli install @anton-abyzov/observability

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositoryanton-abyzov/specweave

Pathplugins/specweave-infrastructure/skills/observability/SKILL.md

Branchdevelop

Scoped Name@anton-abyzov/observability

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

description: Full-stack observability architect for Prometheus, Grafana, OpenTelemetry, distributed tracing (Jaeger/Tempo), SLIs/SLOs, error budgets, and alerting. Use for metrics, dashboards, traces, or reliability engineering. allowed-tools: Read, Write, Edit, Bash model: opus context: fork

Observability Engineer - Full-Stack Monitoring Expert

⚠️ Chunking Rule

Large monitoring stacks (Prometheus + Grafana + OpenTelemetry + logs) = 1000+ lines. Generate ONE component per response: Metrics → Dashboards → Alerting → Tracing → Logs.

Purpose

Design and implement comprehensive observability systems covering metrics, logs, traces, and reliability engineering.

When to Use

Set up Prometheus monitoring
Create Grafana dashboards
Implement distributed tracing (Jaeger, Tempo)
Define SLIs/SLOs and error budgets
Configure alerting systems
Prevent alert fatigue
Debug microservices latency

Scope Boundaries

This skill covers OBSERVABILITY STRATEGY: SLIs/SLOs, error budgets, dashboards, alerting design.

For OpenTelemetry instrumentation details → use /sw-infra:opentelemetry

Core Concepts

Three Pillars of Observability

┌─────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                             │
├─────────────────┬─────────────────┬─────────────────────────┤
│    METRICS      │     LOGS        │        TRACES           │
├─────────────────┼─────────────────┼─────────────────────────┤
│ Prometheus      │ Loki/ELK        │ Jaeger/Tempo            │
│ What happened?  │ Why happened?   │ How requests flow?      │
│ Aggregated data │ Event details   │ Request journey         │
└─────────────────┴─────────────────┴─────────────────────────┘

RED Method (Services)

Rate - Requests per second
Errors - Error rate percentage
Duration - Latency/response time

USE Method (Resources)

Utilization - % time resource is busy
Saturation - Queue length/wait time
Errors - Error count

Prometheus Setup

Installation (Kubernetes)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=30d

Key Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Recording Rules

groups:
  - name: api_metrics
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))
      - record: job:http_requests_error_rate:percentage
        expr: (sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))) * 100

Grafana Dashboards

Dashboard Design Principles

┌─────────────────────────────────────┐
│  Critical Metrics (Big Numbers)     │
├─────────────────────────────────────┤
│  Key Trends (Time Series)           │
├─────────────────────────────────────┤
│  Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘

Essential Queries

# Request rate
sum(rate(http_requests_total[5m])) by (service)

# Error rate %
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

# P95 Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Distributed Tracing

OpenTelemetry Setup (Node.js)

const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');

const provider = new NodeTracerProvider();
provider.addSpanProcessor(new BatchSpanProcessor(new JaegerExporter()));
provider.register();

registerInstrumentations({
  instrumentations: [new HttpInstrumentation()],
});

Context Propagation

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Jaeger Deployment

# Kubernetes
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
spec:
  strategy: production
  storage:
    type: elasticsearch

SLIs/SLOs

Defining SLOs

slos:
  - name: api_availability
    target: 99.9%  # 43.2 min downtime/month
    window: 28d
    sli: sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

  - name: api_latency_p95
    target: 99%    # 99% requests < 500ms
    window: 28d
    sli: sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

Error Budget

Error Budget = 1 - SLO Target
Example: 99.9% SLO → 0.1% error budget → 43.2 min/month

Burn Rate Alerts

rules:
  - alert: SLOErrorBudgetBurnFast
    expr: slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Fast error budget burn - consuming 2% budget in 1 hour"

Alert Fatigue Prevention

Multi-Window Alerting

# Combine short + long windows to reduce false positives
- alert: HighLatency
  expr: |
    (job:http_request_duration:p95_5m > 1 AND job:http_request_duration:p95_1h > 0.8)
  for: 5m

Severity Levels

Severity	Response	Examples
critical	Page immediately	Service down, data loss
warning	Review in hours	Degraded performance
info	Daily review	Capacity planning

Best Practices

Start with RED/USE methods for consistent metrics
Use recording rules for expensive queries
Implement multi-window alerts to reduce noise
Set achievable SLOs (don't aim for 100%)
Track error budget consistently
Correlate traces with metrics using trace IDs
Sample traces appropriately (1-10% in production)
Add context to spans (user_id, request_id)

Related Skills

devops - Infrastructure provisioning

More by anton-abyzov

View all

brownfield-analyzer

107

Analyzes existing brownfield projects to map documentation structure to SpecWeave's PRD/HLD/Spec/Runbook pattern. Scans folders, classifies documents, detects external tools (Jira, ADO, GitHub), and creates project context map for just-in-time migration. Activates for brownfield, existing project, migrate, analyze structure, legacy documentation.

frontend

107

Frontend Development Expert: Frontend developer for React, Vue, Angular, TypeScript. Use for components, hooks, state management, responsive UIs. Covers React 18/19, custom hooks, forms, a11y.

automl-optimizer

107

Automated machine learning with hyperparameter optimization using Optuna, Hyperopt, or AutoML libraries. Activates for "automl", "hyperparameter tuning", "optimize hyperparameters", "auto tune model", "neural architecture search", "automated ml". Systematically explores model and hyperparameter spaces, tracks all experiments, and finds optimal configurations with minimal manual intervention.

devops

107

DevOps Agent - Infrastructure & Deployment Expert: DevOps and IaC expert for Terraform, Kubernetes, Docker, CI/CD pipelines, and deployment platform decisions (Vercel vs Cloudflare vs Hetzner). Generates infrastructure ONE COMPONENT AT A TIME to prevent crashes.