cross-agent-review evaluations

Smaller headline (+16pp) because three of four evals had a strong baseline. The redaction-discipline eval (+50pp) is the load-bearing result: without the skill, handoff packages routinely include “all tests pass” or the implementer’s reasoning trace, anchoring the reviewer toward rubber-stamping.

Headline result

Metric	With Skill	Baseline	Δ
Pass rate	100%	84%	+16pp

Per-eval breakdown

Eval	What it probes	With	Baseline	Δ
`build-handoff-package-with-redaction`	Package includes diff + design + acceptance criteria. REDACTS self-assessment, framing, reasoning trace, gate-passed claims.	100%	50%	+50pp
`perform-cold-review-no-anchoring`	Reviewer treats the work cold; catches load-bearing issues (timing attack, signing-key misuse, missing malformed/expired handling).	100%	88%	+12pp
`write-disagreement-summary`	Even-handed disagreement summary in the prescribed 6-section format with named human escalation.	100%	100%	0pp
`skip-cross-agent-for-trivial-work`	Correctly skips for typos, applies for high-value triggers.	100%	100%	0pp

The two non-discriminating evals are flagged for redesign in iteration-2. The current prompts contain enough policy framing that the baseline gets there too.

Going deeper: the framing-anchors-review-outcome experiment

The four evals above test whether the skill produces correct artifacts. A separate, larger experiment tests whether the redaction intervention changes downstream review outcomes: 6 diffs (5 with planted bugs + 1 control) × 4 conditions (redacted + 3 framing intensities) × 2 reviewer providers (Claude, Codex) × 2 runs = 96 cells. Real signals on severity downgrading and false-positive tolerance under framing.

Read the framing-anchors-review-outcome writeup for results.

Browse the full data

Interactive review viewer: side-by-side outputs with per-assertion grading evidence.
Eval definitions: prompts and per-assertion criteria.
Workspace: raw output.md, grading.json, timing.json per run.

Back to skill

cross-agent-review skill page