jeremylongshore

apollo-incident-runbook

@jeremylongshore/apollo-incident-runbook
jeremylongshore
1,004
123 forks
Updated 1/18/2026
View on GitHub

Apollo.io incident response procedures. Use when handling Apollo outages, debugging production issues, or responding to integration failures. Trigger with phrases like "apollo incident", "apollo outage", "apollo down", "apollo production issue", "apollo emergency".

Installation

$skills install @jeremylongshore/apollo-incident-runbook
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathplugins/saas-packs/apollo-pack/skills/apollo-incident-runbook/SKILL.md
Branchmain
Scoped Name@jeremylongshore/apollo-incident-runbook

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

skills list

Skill Instructions


name: apollo-incident-runbook description: | Apollo.io incident response procedures. Use when handling Apollo outages, debugging production issues, or responding to integration failures. Trigger with phrases like "apollo incident", "apollo outage", "apollo down", "apollo production issue", "apollo emergency". allowed-tools: Read, Write, Edit, Bash(kubectl:), Bash(curl:) version: 1.0.0 license: MIT author: Jeremy Longshore jeremy@intentsolutions.io

Apollo Incident Runbook

Overview

Structured incident response procedures for Apollo.io integration issues with diagnosis steps, mitigation actions, and recovery procedures.

Incident Classification

SeverityImpactResponse TimeExamples
P1 - CriticalComplete outage15 minAPI down, auth failed
P2 - MajorDegraded service1 hourHigh error rate, slow responses
P3 - MinorLimited impact4 hoursCache issues, minor errors
P4 - LowNo user impactNext dayLog warnings, cosmetic issues

Quick Diagnosis Commands

# Check Apollo status
curl -s https://status.apollo.io/api/v2/status.json | jq '.status'

# Verify API key
curl -s "https://api.apollo.io/v1/auth/health?api_key=$APOLLO_API_KEY" | jq

# Check rate limit status
curl -I "https://api.apollo.io/v1/people/search" \
  -H "Content-Type: application/json" \
  -d '{"api_key": "'$APOLLO_API_KEY'", "per_page": 1}' 2>/dev/null \
  | grep -i "ratelimit"

# Check application health
curl -s http://localhost:3000/health/apollo | jq

# Check error logs
kubectl logs -l app=apollo-service --tail=100 | grep -i error

# Check metrics
curl -s http://localhost:3000/metrics | grep apollo_

Incident Response Procedures

P1: Complete API Failure

Symptoms:

  • All Apollo requests returning 5xx errors
  • Health check endpoint failing
  • Alerts firing on error rate

Immediate Actions (0-15 min):

# 1. Confirm Apollo is down (not just us)
curl -s https://status.apollo.io/api/v2/status.json | jq

# 2. Enable circuit breaker / fallback mode
kubectl set env deployment/apollo-service APOLLO_FALLBACK_MODE=true

# 3. Notify stakeholders
# Post to #incidents Slack channel

# 4. Check if it's our API key
curl -s "https://api.apollo.io/v1/auth/health?api_key=$APOLLO_API_KEY"
# If 401: Key is invalid - check if rotated

Fallback Mode Implementation:

// src/lib/apollo/circuit-breaker.ts
class CircuitBreaker {
  private failures = 0;
  private lastFailure: Date | null = null;
  private isOpen = false;

  async execute<T>(fn: () => Promise<T>, fallback: () => T): Promise<T> {
    if (this.isOpen) {
      if (this.shouldAttemptReset()) {
        this.isOpen = false;
      } else {
        console.warn('Circuit breaker open, using fallback');
        return fallback();
      }
    }

    try {
      const result = await fn();
      this.failures = 0;
      return result;
    } catch (error) {
      this.failures++;
      this.lastFailure = new Date();

      if (this.failures >= 5) {
        this.isOpen = true;
        console.error('Circuit breaker opened after 5 failures');
      }

      return fallback();
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailure) return true;
    const elapsed = Date.now() - this.lastFailure.getTime();
    return elapsed > 60000; // Try again after 1 minute
  }
}

// Fallback data source
async function getFallbackContacts(criteria: any) {
  // Return cached data
  const cached = await apolloCache.search(criteria);
  if (cached.length > 0) return cached;

  // Return empty with warning
  console.warn('No fallback data available');
  return [];
}

Recovery Steps:

# 1. Monitor Apollo status page for resolution
watch -n 30 'curl -s https://status.apollo.io/api/v2/status.json | jq'

# 2. When Apollo is back, disable fallback mode
kubectl set env deployment/apollo-service APOLLO_FALLBACK_MODE=false

# 3. Verify connectivity
curl -s "https://api.apollo.io/v1/auth/health?api_key=$APOLLO_API_KEY"

# 4. Check for request backlog
kubectl logs -l app=apollo-service | grep -c "queued"

# 5. Gradually restore traffic
kubectl scale deployment/apollo-service --replicas=1
# Wait, verify healthy
kubectl scale deployment/apollo-service --replicas=3

P1: API Key Compromised

Symptoms:

  • Unexpected 401 errors
  • Unusual usage patterns
  • Alert from Apollo about suspicious activity

Immediate Actions:

# 1. Rotate API key immediately in Apollo dashboard
# Settings > Integrations > API > Regenerate Key

# 2. Update secret in production
# Kubernetes
kubectl create secret generic apollo-secrets \
  --from-literal=api-key=NEW_KEY \
  --dry-run=client -o yaml | kubectl apply -f -

# 3. Restart deployments to pick up new key
kubectl rollout restart deployment/apollo-service

# 4. Audit usage logs
kubectl logs -l app=apollo-service --since=24h | grep "apollo_request"

Post-Incident:

  • Review access controls
  • Enable IP allowlisting if available
  • Implement key rotation schedule

P2: High Error Rate

Symptoms:

  • Error rate > 5%
  • Mix of successful and failed requests
  • Alerts on apollo_errors_total

Diagnosis:

# Check error distribution
curl -s http://localhost:3000/metrics | grep apollo_errors_total

# Sample recent errors
kubectl logs -l app=apollo-service --tail=500 | grep -A2 "apollo_error"

# Check if specific endpoint is failing
curl -s http://localhost:3000/metrics | grep apollo_requests_total | sort

Common Causes & Fixes:

Error TypeLikely CauseFix
validation_errorBad request formatCheck request payload
rate_limitToo many requestsEnable backoff, reduce concurrency
auth_errorKey issueVerify API key
timeoutNetwork/Apollo slowIncrease timeout, add retry

Mitigation:

# Reduce request rate
kubectl set env deployment/apollo-service APOLLO_RATE_LIMIT=50

# Enable aggressive caching
kubectl set env deployment/apollo-service APOLLO_CACHE_TTL=3600

# Scale down to reduce load
kubectl scale deployment/apollo-service --replicas=1

P2: Rate Limit Exceeded

Symptoms:

  • 429 responses
  • apollo_rate_limit_hits_total increasing
  • Requests queuing

Immediate Actions:

# 1. Check current rate limit status
curl -I "https://api.apollo.io/v1/auth/health?api_key=$APOLLO_API_KEY" \
  | grep -i ratelimit

# 2. Pause non-essential operations
kubectl set env deployment/apollo-service \
  APOLLO_PAUSE_BACKGROUND_JOBS=true

# 3. Reduce concurrency
kubectl set env deployment/apollo-service \
  APOLLO_MAX_CONCURRENT=2

# 4. Wait for rate limit to reset (typically 1 minute)
sleep 60

# 5. Gradually resume
kubectl set env deployment/apollo-service \
  APOLLO_MAX_CONCURRENT=5 \
  APOLLO_PAUSE_BACKGROUND_JOBS=false

Prevention:

// Implement request budgeting
class RequestBudget {
  private used = 0;
  private resetTime: Date;

  constructor(private limit: number = 90) {
    this.resetTime = this.getNextMinute();
  }

  async acquire(): Promise<boolean> {
    if (new Date() > this.resetTime) {
      this.used = 0;
      this.resetTime = this.getNextMinute();
    }

    if (this.used >= this.limit) {
      const waitMs = this.resetTime.getTime() - Date.now();
      console.warn(`Budget exhausted, waiting ${waitMs}ms`);
      await new Promise(r => setTimeout(r, waitMs));
      return this.acquire();
    }

    this.used++;
    return true;
  }

  private getNextMinute(): Date {
    const next = new Date();
    next.setSeconds(0, 0);
    next.setMinutes(next.getMinutes() + 1);
    return next;
  }
}

P3: Slow Responses

Symptoms:

  • P95 latency > 5 seconds
  • Timeouts occurring
  • User complaints about slow search

Diagnosis:

# Check latency metrics
curl -s http://localhost:3000/metrics \
  | grep apollo_request_duration

# Check Apollo's response time
time curl -s "https://api.apollo.io/v1/auth/health?api_key=$APOLLO_API_KEY"

# Check our application latency
kubectl top pods -l app=apollo-service

Mitigation:

# Increase timeout
kubectl set env deployment/apollo-service APOLLO_TIMEOUT=60000

# Enable request hedging (send duplicate requests)
kubectl set env deployment/apollo-service APOLLO_HEDGE_REQUESTS=true

# Reduce payload size (request fewer results)
kubectl set env deployment/apollo-service APOLLO_DEFAULT_PER_PAGE=25

Post-Incident Template

## Incident Report: [Title]

**Date:** [Date]
**Duration:** [Start] - [End] ([X] minutes)
**Severity:** P[1-4]
**Affected Systems:** Apollo integration

### Summary
[1-2 sentence description]

### Timeline
- HH:MM - Issue detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Service restored

### Root Cause
[Description of what caused the incident]

### Impact
- [Number] of failed requests
- [Number] of affected users
- [Duration] of degraded service

### Resolution
[What was done to fix the issue]

### Action Items
- [ ] [Preventive measure 1]
- [ ] [Preventive measure 2]
- [ ] [Monitoring improvement]

### Lessons Learned
[What we learned from this incident]

Output

  • Incident classification matrix
  • Quick diagnosis commands
  • Response procedures by severity
  • Circuit breaker implementation
  • Post-incident template

Error Handling

IssueEscalation
P1 > 30 minPage on-call lead
P2 > 2 hoursNotify management
Recurring P3Create P2 tracking
Apollo outageOpen support ticket

Resources

Next Steps

Proceed to apollo-data-handling for data management.

More by jeremylongshore

View all
rabbitmq-queue-setup
1,004

Rabbitmq Queue Setup - Auto-activating skill for Backend Development. Triggers on: rabbitmq queue setup, rabbitmq queue setup Part of the Backend Development skill category.

model-evaluation-suite
1,004

evaluating-machine-learning-models: This skill allows Claude to evaluate machine learning models using a comprehensive suite of metrics. It should be used when the user requests model performance analysis, validation, or testing. Claude can use this skill to assess model accuracy, precision, recall, F1-score, and other relevant metrics. Trigger this skill when the user mentions "evaluate model", "model performance", "testing metrics", "validation results", or requests a comprehensive "model evaluation".

neural-network-builder
1,004

building-neural-networks: This skill allows Claude to construct and configure neural network architectures using the neural-network-builder plugin. It should be used when the user requests the creation of a new neural network, modification of an existing one, or assistance with defining the layers, parameters, and training process. The skill is triggered by requests involving terms like "build a neural network," "define network architecture," "configure layers," or specific mentions of neural network types (e.g., "CNN," "RNN," "transformer").

oauth-callback-handler
1,004

Oauth Callback Handler - Auto-activating skill for API Integration. Triggers on: oauth callback handler, oauth callback handler Part of the API Integration skill category.