evaluation-anchor-checker

@WILLOSCAR/evaluation-anchor-checker

WILLOSCAR

421

29 forks

Updated 4/29/2026

View on GitHub

Audit and rewrite evaluation/numeric claims to ensure they carry minimal protocol context (task + metric + constraint) and avoid underspecified model naming. **Trigger**: evaluation anchor checker, numeric claim hygiene, underspecified numbers, protocol context, 评测锚点检查, 数字断言, 指标上下文. **Use when**: before final merge/polish, or when reviewers would likely flag claims as underspecified (numbers without task/metric/budget), or `pipeline-auditor` warns about suspicious model naming. **Skip if**: evidence is too thin to justify numeric claims (route upstream to C3/C4), or you are pre-C2 (NO PROSE). **Network**: none. **Guardrail**: do not invent numbers; do not add/remove/move citation keys; if protocol context is missing, weaken/remove the numeric claim rather than guessing.

Installation

$npx agent-skills-cli install @WILLOSCAR/evaluation-anchor-checker

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

RepositoryWILLOSCAR/research-units-pipeline-skills

Path.codex/skills/evaluation-anchor-checker/SKILL.md

Branchmain

Scoped Name@WILLOSCAR/evaluation-anchor-checker

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: evaluation-anchor-checker description: | Audit and rewrite evaluation/numeric claims to ensure they carry minimal protocol context (task + metric + constraint) and avoid underspecified model naming. Trigger: evaluation anchor checker, numeric claim hygiene, underspecified numbers, protocol context, 评测锚点检查, 数字断言, 指标上下文. Use when: before final merge/polish, or when reviewers would likely flag claims as underspecified (numbers without task/metric/budget), or `pipeline-auditor` warns about suspicious model naming. Skip if: evidence is too thin to justify numeric claims (route upstream to C3/C4), or you are pre-C2 (NO PROSE). Network: none. Guardrail: do not invent numbers; do not add/remove/move citation keys; if protocol context is missing, weaken/remove the numeric claim rather than guessing.

Evaluation Anchor Checker (make numbers reviewer-safe)

Purpose: fix a reviewer-magnet failure mode in agent surveys:

strong numeric/performance statements appear
but the minimal evaluation context is missing

This skill treats numeric claims as contracts:

if a number stays, the same sentence must contain enough protocol context to interpret it
if that context is not in evidence, the claim must be downgraded (no guessing)

Inputs

Preferred (pre-merge, keeps anchoring intact):

the affected sections/*.md files

Optional context (read-only; helps you avoid guessing):

outline/writer_context_packs.jsonl (look for evaluation_anchor_minimal, evaluation_protocol, anchor_facts)
outline/evidence_drafts.jsonl / outline/anchor_sheet.jsonl
citations/ref.bib

Outputs

Updated sections/*.md (or output/DRAFT.md if you are post-merge), with safer evaluation anchoring
output/EVAL_ANCHOR_REPORT.md (always; short report with files checked / changed / weakened sentences)
Optional completion marker: output/eval_anchors_checked.refined.ok

Recommended slot in the survey pipeline

Use this as the last section-level numeric hygiene sweep before merge:

after paragraph-curator + style-harmonizer + opener-variator
before transition-weaver / section-merger

Reason:

earlier section-level rewrite passes can legitimately rephrase or fuse numeric sentences
if you only wait for pipeline-auditor, numeric-context issues are discovered too late in the merged draft
section-scoped fixes are cheaper and preserve citation anchoring better than post-merge patching

Read Order

Always read:

references/numeric_hygiene.md

Machine-readable asset:

assets/numeric_hygiene.json

The asset defines the keyword families and qualitative fallback templates. Keep the script deterministic and let the policy live in the asset/reference pair.

Role prompt: Reviewer-minded Editor (evaluation hygiene)

You are a reviewer-minded editor for evaluation claims in a technical survey.

Goal:
- make every numeric/performance claim interpretable and reviewer-safe

Hard constraints:
- do not invent numbers
- do not add/remove/move citation keys
- if protocol context is missing, weaken or remove the numeric claim

Minimum context to include when keeping a number:
- task / setting (what kind of task)
- metric (what is being measured)
- constraint (budget/cost/tool access/horizon/seed/logging) when relevant

Avoid:
- ambiguous model naming that looks hallucinated (e.g., “GPT-5”) unless the cited paper uses it verbatim

Workflow (explicit inputs)

Use outline/writer_context_packs.jsonl to locate the subsection's allowed citations and any extracted evaluation_protocol/anchor_facts.
Cross-check outline/evidence_drafts.jsonl and outline/anchor_sheet.jsonl for task/metric/constraint context before touching numbers.
Validate every cited key against citations/ref.bib (do not introduce new keys).
Write output/EVAL_ANCHOR_REPORT.md so the pipeline has an auditable completion artifact for this sweep.

What to enforce (the “minimum protocol trio”)

When a sentence contains digits (%, x, or numbers):

Keep the number only if you can attach at least 2 of the following in the same sentence without guessing:
- task family / benchmark name
- metric definition
- constraint (budget, tool access, cost model, retries, horizon)

If you cannot, downgrade:

remove the number and rewrite as qualitative (“often”, “can”, “may”) with the same citation
or move the specificity into a verification target (“evaluations need to report …”) without adding new facts

Mini examples (paraphrase; do not copy)

Bad (underspecified):

Model X achieves 75% exact performance [@SomeBench].

Better (minimal context):

On <task/benchmark>, Model X reaches ~75% <metric>, under <constraint/budget/tool access> [@SomeBench].

Better (downgrade when context is missing):

Reported gains vary, but comparisons remain fragile when budgets and retry policies are not reported [@SomeBench].

Done checklist

output/EVAL_ANCHOR_REPORT.md exists and reports a non-zero file count.
No numeric claim remains without minimal protocol context.
No ambiguous model naming remains unless explicitly supported by citations.
Citation keys are unchanged.
If you removed/downgraded numbers, the paragraph still makes a defensible, evidence-bounded point.

Script

Quick Start

python .codex/skills/evaluation-anchor-checker/scripts/run.py --workspace workspaces/<ws>

More by WILLOSCAR

View all

writer-context-pack

421

Build per-H3 writer context packs (NO PROSE): merge briefs + evidence packs + anchor facts + allowed citations into a single deterministic JSONL, so drafting is less hollow and less brittle. **Trigger**: writer context pack, context pack, drafting pack, paragraph plan pack, 写作上下文包. **Use when**: `outline/subsection_briefs.jsonl` + `outline/evidence_drafts.jsonl` + `outline/anchor_sheet.jsonl` exist and you want to make C5 drafting easier/more consistent. **Skip if**: upstream evidence is missing or scaffolded (fix `paper-notes` / `evidence-binder` / `evidence-draft` / `anchor-sheet` first). **Network**: none. **Guardrail**: NO PROSE; do not invent facts/citations; only use citation keys present in `citations/ref.bib`.

claim-evidence-matrix

421

Build a section-by-section claim–evidence matrix (`outline/claim_evidence_matrix.md`) from the outline and paper notes. **Trigger**: claim–evidence matrix, evidence mapping, 证据矩阵, 主张-证据对齐. **Use when**: 写 prose 之前需要把每个小节的可检验主张与证据来源显式化（outline + paper notes 已就绪）。 **Skip if**: 缺少 `outline/outline.yml` 或 `papers/paper_notes.jsonl`。 **Network**: none. **Guardrail**: bullets-only（NO PROSE）；每个 claim 至少 2 个证据来源（或显式说明例外）。

research-pipeline-runner

421

Run this repo’s Units+Checkpoints research pipelines end-to-end (survey/brief/paper-review/evidence-review/idea/tutorial/graduate-paper), with workspaces + checkpoints. **Trigger**: run pipeline, kickoff, 继续执行, 自动跑, 写一篇, survey/brief/review/调研/教程/系统综述/审稿. **Use when**: 用户希望端到端跑流程（创建 `workspaces/<name>/`、生成/执行 `UNITS.csv`、遇到 HUMAN checkpoint 停下等待）。 **Skip if**: 用户明确要手工逐条执行（用 `unit-executor`），或你不应自动推进到 prose 阶段。 **Network**: depends on selected pipeline (arXiv/PDF/citation verification may need network; offline import supported where available). **Guardrail**: 必须尊重 checkpoints（无 Approve 不写 prose）；遇到 HUMAN 单元必须停下等待；禁止在 repo root 创建 workspace 工件。

subsection-polisher

421

Polish a single H3 unit file under `sections/` into survey-grade prose (de-template + contrast/eval/limitation), without changing citation keys. **Trigger**: subsection polisher, per-subsection polish, polish section file, 小节润色, 去模板, 结构化段落. **Use when**: `sections/S*.md` exists but reads rigid/template-y; you want to fix quality locally before `section-merger`. **Skip if**: subsection files are missing, evidence packs are incomplete, or `Approve C2` is not recorded. **Network**: none. **Guardrail**: do not invent facts/citations; do not add/remove citation keys; keep citations within the same H3; keep citations subsection-scoped.