Observability Engineer - Full-Stack Monitoring Expert: Full-stack observability architect for Prometheus, Grafana, OpenTelemetry, distributed tracing (Jaeger/Tempo), SLIs/SLOs, error budgets, and alerting. Use for metrics, dashboards, traces, or reliability engineering.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
description: Full-stack observability architect for Prometheus, Grafana, OpenTelemetry, distributed tracing (Jaeger/Tempo), SLIs/SLOs, error budgets, and alerting. Use for metrics, dashboards, traces, or reliability engineering. allowed-tools: Read, Write, Edit, Bash model: opus context: fork
Observability Engineer - Full-Stack Monitoring Expert
⚠️ Chunking Rule
Large monitoring stacks (Prometheus + Grafana + OpenTelemetry + logs) = 1000+ lines. Generate ONE component per response: Metrics → Dashboards → Alerting → Tracing → Logs.
Purpose
Design and implement comprehensive observability systems covering metrics, logs, traces, and reliability engineering.
When to Use
- Set up Prometheus monitoring
- Create Grafana dashboards
- Implement distributed tracing (Jaeger, Tempo)
- Define SLIs/SLOs and error budgets
- Configure alerting systems
- Prevent alert fatigue
- Debug microservices latency
Scope Boundaries
This skill covers OBSERVABILITY STRATEGY: SLIs/SLOs, error budgets, dashboards, alerting design.
- For OpenTelemetry instrumentation details → use
/sw-infra:opentelemetry
Core Concepts
Three Pillars of Observability
┌─────────────────────────────────────────────────────────────┐
│ OBSERVABILITY │
├─────────────────┬─────────────────┬─────────────────────────┤
│ METRICS │ LOGS │ TRACES │
├─────────────────┼─────────────────┼─────────────────────────┤
│ Prometheus │ Loki/ELK │ Jaeger/Tempo │
│ What happened? │ Why happened? │ How requests flow? │
│ Aggregated data │ Event details │ Request journey │
└─────────────────┴─────────────────┴─────────────────────────┘
RED Method (Services)
- Rate - Requests per second
- Errors - Error rate percentage
- Duration - Latency/response time
USE Method (Resources)
- Utilization - % time resource is busy
- Saturation - Queue length/wait time
- Errors - Error count
Prometheus Setup
Installation (Kubernetes)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.retention=30d
Key Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Recording Rules
groups:
- name: api_metrics
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_requests_error_rate:percentage
expr: (sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))) * 100
Grafana Dashboards
Dashboard Design Principles
┌─────────────────────────────────────┐
│ Critical Metrics (Big Numbers) │
├─────────────────────────────────────┤
│ Key Trends (Time Series) │
├─────────────────────────────────────┤
│ Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘
Essential Queries
# Request rate
sum(rate(http_requests_total[5m])) by (service)
# Error rate %
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
# P95 Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
Distributed Tracing
OpenTelemetry Setup (Node.js)
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const provider = new NodeTracerProvider();
provider.addSpanProcessor(new BatchSpanProcessor(new JaegerExporter()));
provider.register();
registerInstrumentations({
instrumentations: [new HttpInstrumentation()],
});
Context Propagation
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
Jaeger Deployment
# Kubernetes
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
spec:
strategy: production
storage:
type: elasticsearch
SLIs/SLOs
Defining SLOs
slos:
- name: api_availability
target: 99.9% # 43.2 min downtime/month
window: 28d
sli: sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99% # 99% requests < 500ms
window: 28d
sli: sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
Error Budget
Error Budget = 1 - SLO Target
Example: 99.9% SLO → 0.1% error budget → 43.2 min/month
Burn Rate Alerts
rules:
- alert: SLOErrorBudgetBurnFast
expr: slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn - consuming 2% budget in 1 hour"
Alert Fatigue Prevention
Multi-Window Alerting
# Combine short + long windows to reduce false positives
- alert: HighLatency
expr: |
(job:http_request_duration:p95_5m > 1 AND job:http_request_duration:p95_1h > 0.8)
for: 5m
Severity Levels
| Severity | Response | Examples |
|---|---|---|
| critical | Page immediately | Service down, data loss |
| warning | Review in hours | Degraded performance |
| info | Daily review | Capacity planning |
Best Practices
- Start with RED/USE methods for consistent metrics
- Use recording rules for expensive queries
- Implement multi-window alerts to reduce noise
- Set achievable SLOs (don't aim for 100%)
- Track error budget consistently
- Correlate traces with metrics using trace IDs
- Sample traces appropriately (1-10% in production)
- Add context to spans (user_id, request_id)
Related Skills
devops- Infrastructure provisioning
More by anton-abyzov
View allAnalyzes existing brownfield projects to map documentation structure to SpecWeave's PRD/HLD/Spec/Runbook pattern. Scans folders, classifies documents, detects external tools (Jira, ADO, GitHub), and creates project context map for just-in-time migration. Activates for brownfield, existing project, migrate, analyze structure, legacy documentation.
Frontend Development Expert: Frontend developer for React, Vue, Angular, TypeScript. Use for components, hooks, state management, responsive UIs. Covers React 18/19, custom hooks, forms, a11y.
Automated machine learning with hyperparameter optimization using Optuna, Hyperopt, or AutoML libraries. Activates for "automl", "hyperparameter tuning", "optimize hyperparameters", "auto tune model", "neural architecture search", "automated ml". Systematically explores model and hyperparameter spaces, tracks all experiments, and finds optimal configurations with minimal manual intervention.
DevOps Agent - Infrastructure & Deployment Expert: DevOps and IaC expert for Terraform, Kubernetes, Docker, CI/CD pipelines, and deployment platform decisions (Vercel vs Cloudflare vs Hetzner). Generates infrastructure ONE COMPONENT AT A TIME to prevent crashes.
