task-handoff-summaries evaluations

The smallest headline in the collection (+12pp). Three of four evals had a strong baseline: the formats are intuitive enough that an unprompted model gets close. The implementation-summary eval (+27pp) is the discriminator, mostly on the “every field is a commitment” rule.

Headline result

Metric	With Skill	Baseline	Δ
Pass rate	100%	88%	+12pp

Per-eval breakdown

Eval	What it probes	With	Baseline	Δ
`single-agent-implementation-summary`	Implementation summary before commit: scope, decisions, test results, behavior-impact, what’s NOT in scope.	100%	73%	+27pp
`multi-agent-worker-handoff`	Worker handoff to orchestrator: branch + commit SHA, gate status, ready-for-acceptance signal, no self-merge.	100%	90%	+10pp
`closeout-summary-with-evidence`	Closeout: completed criteria, evidence per criterion, follow-ups, no glossing over incomplete items.	100%	91%	+9pp
`lightweight-no-task-closeout`	When there’s no formal task, the format degrades gracefully to change + why + verification, without inventing a fake task ID.	100%	100%	0pp

The implementation-summary eval is the load-bearing one. The hard rule the skill encodes is “every field is a commitment”. Quality Gates: passed without numbers is a smell. Untrained models cheerfully write Tests: passing and leave it at that; with the skill, they cite the actual count.

The lightweight-no-task eval is non-discriminating in iteration-1. Sharpen for iteration-2.

Browse the full data

Interactive review viewer: side-by-side outputs with per-assertion grading evidence.
Eval definitions: prompts and per-assertion criteria.
Workspace: raw output.md, grading.json, timing.json per run.

Back to skill

task-handoff-summaries skill page