task-handoff-summaries evaluations
The smallest headline in the collection (+12pp). Three of four evals had a strong baseline: the formats are intuitive enough that an unprompted model gets close. The implementation-summary eval (+27pp) is the discriminator, mostly on the “every field is a commitment” rule.
Headline result
| Metric | With Skill | Baseline | Δ |
|---|---|---|---|
| Pass rate | 100% | 88% | +12pp |
Per-eval breakdown
| Eval | What it probes | With | Baseline | Δ |
|---|---|---|---|---|
single-agent-implementation-summary | Implementation summary before commit: scope, decisions, test results, behavior-impact, what’s NOT in scope. | 100% | 73% | +27pp |
multi-agent-worker-handoff | Worker handoff to orchestrator: branch + commit SHA, gate status, ready-for-acceptance signal, no self-merge. | 100% | 90% | +10pp |
closeout-summary-with-evidence | Closeout: completed criteria, evidence per criterion, follow-ups, no glossing over incomplete items. | 100% | 91% | +9pp |
lightweight-no-task-closeout | When there’s no formal task, the format degrades gracefully to change + why + verification, without inventing a fake task ID. | 100% | 100% | 0pp |
The implementation-summary eval is the load-bearing one. The hard rule the skill encodes is “every field is a commitment”. Quality Gates: passed without numbers is a smell. Untrained models cheerfully write Tests: passing and leave it at that; with the skill, they cite the actual count.
The lightweight-no-task eval is non-discriminating in iteration-1. Sharpen for iteration-2.
Browse the full data
- Interactive review viewer: side-by-side outputs with per-assertion grading evidence.
- Eval definitions: prompts and per-assertion criteria.
- Workspace: raw
output.md,grading.json,timing.jsonper run.