harbor

@benchflow-ai/harbor

benchflow-ai

230

165 forks

Updated 1/18/2026

View on GitHub

Harbor framework for agent evaluation. Use when: (1) Running harbor commands (harbor run, harbor tasks check), (2) Creating/validating SkillsBench tasks, (3) Understanding task format or debugging failures.

Installation

$skills install @benchflow-ai/harbor

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorybenchflow-ai/skillsbench

Path.claude/skills/harbor/SKILL.md

Branchmain

Scoped Name@benchflow-ai/harbor

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

skills list

Skill Instructions

name: harbor description: "Harbor framework for agent evaluation. Use when: (1) Running harbor commands (harbor run, harbor tasks check), (2) Creating/validating SkillsBench tasks, (3) Understanding task format or debugging failures."

Harbor

Agent evaluation framework from the creators of Terminal-Bench.

Official Documentation

Docs: https://harborframework.com/docs
Getting Started: https://harborframework.com/docs/getting-started
GitHub: https://github.com/laude-institute/harbor

Local Workspace & API Keys

.local-workspace/ - Git-ignored directory for cloning PRs, temporary files, external repos, etc.
.local-workspace/.env - May contain ANTHROPIC_API_KEY and other API credentials. Check and use when running harbor with API access.

Quick Reference

# Install
uv tool install harbor

# Validate task
harbor tasks check tasks/<task-id>

# Run oracle (must pass 100%)
harbor run -p tasks/<task-id> -a oracle

# Run with agent (specify model with -m)
harbor run -p tasks/<task-id> -a claude-code -m 'anthropic/claude-opus-4-5'

# List datasets
harbor datasets list

# Cloud execution (parallel)
harbor run -d "<dataset@version>" -a "<agent>" -m "<model>" --env "daytona" -n 32

SkillsBench Task Structure

tasks/<task-id>/
  task.toml           # Metadata
  instruction.md      # Agent instructions
  environment/
    Dockerfile        # Container + COPY skills to all agent locations
    skills/           # Skills for agents
  tests/
    test.sh           # Runs pytest, writes reward.txt
    test_outputs.py   # Test cases
  solution/
    solve.sh          # Oracle solution (human-written)

Results Location

jobs/<timestamp>/<task-id>/:

trial.log - Execution log
verifier/reward.txt - 0 (fail) or 1 (pass)
verifier/ctrf.json - Test details

For task format details, see references/task-format.md

Agent Skill Support

Skills are copied to agent-specific locations in task Dockerfiles. Place skills in environment/skills/ and they'll be copied to:

Supported by Harbor (benchmarkable)

Agent	Skills Directory	Docs
Claude Code	`.claude/skills/`	docs
Codex (OpenAI)	`.codex/skills/`	docs
OpenCode	`.opencode/skill/` or `.claude/skills/`	docs
Goose	`.goose/skills/` or `.claude/skills/`	docs
Factory	`.factory/skills/`	docs
Portable format	`.agents/skills/`	Used by Goose, Amp
GitHub Copilot	`.github/skills/`	docs

Not yet supported by Harbor

Agent	Skills Directory	Docs
Amp	`.agents/skills/` or `.claude/skills/`	docs
Letta	`.skills/`	docs

Adding Skills to Tasks

# Copy skills to ALL agent paths in Dockerfile
COPY skills /root/.claude/skills
COPY skills /root/.codex/skills
COPY skills /root/.opencode/skill
COPY skills /root/.goose/skills
COPY skills /root/.factory/skills
COPY skills /root/.agents/skills
COPY skills /root/.github/skills

More by benchflow-ai

View all

fjsp-baseline-repair-with-downtime-and-policy

230

Repair an (often imperfect) Flexible Job Shop Scheduling baseline into a downtime-feasible, precedence-correct schedule while staying within policy budgets and matching the evaluator’s exact metrics and “local minimal right-shift” checks.

temporal-python-testing

230

Test Temporal workflows with pytest, time-skipping, and mocking strategies. Covers unit testing, integration testing, replay testing, and local development setup. Use when implementing Temporal workflow tests or debugging test failures.

locational-marginal-prices

230

Extract locational marginal prices (LMPs) from DC-OPF solutions using dual values. Use when computing nodal electricity prices, reserve clearing prices, or performing price impact analysis.

Managed Package Architecture

230

This skill should be used when the user asks to "design package structure", "create managed package", "configure 2GP", "set up namespace", "version management", or mentions managed package topics like "LMA", "subscriber orgs", or "package versioning". Provides comprehensive guidance for second-generation managed package (2GP) architecture, ISV development patterns, and package lifecycle management.