cross-agent-review evaluations
Smaller headline (+16pp) because three of four evals had a strong baseline. The redaction-discipline eval (+50pp) is the load-bearing result: without the skill, handoff packages routinely include “all tests pass” or the implementer’s reasoning trace, anchoring the reviewer toward rubber-stamping.
Headline result
| Metric | With Skill | Baseline | Δ |
|---|---|---|---|
| Pass rate | 100% | 84% | +16pp |
Per-eval breakdown
| Eval | What it probes | With | Baseline | Δ |
|---|---|---|---|---|
build-handoff-package-with-redaction | Package includes diff + design + acceptance criteria. REDACTS self-assessment, framing, reasoning trace, gate-passed claims. | 100% | 50% | +50pp |
perform-cold-review-no-anchoring | Reviewer treats the work cold; catches load-bearing issues (timing attack, signing-key misuse, missing malformed/expired handling). | 100% | 88% | +12pp |
write-disagreement-summary | Even-handed disagreement summary in the prescribed 6-section format with named human escalation. | 100% | 100% | 0pp |
skip-cross-agent-for-trivial-work | Correctly skips for typos, applies for high-value triggers. | 100% | 100% | 0pp |
The two non-discriminating evals are flagged for redesign in iteration-2. The current prompts contain enough policy framing that the baseline gets there too.
Going deeper: the framing-anchors-review-outcome experiment
The four evals above test whether the skill produces correct artifacts. A separate, larger experiment tests whether the redaction intervention changes downstream review outcomes: 6 diffs (5 with planted bugs + 1 control) × 4 conditions (redacted + 3 framing intensities) × 2 reviewer providers (Claude, Codex) × 2 runs = 96 cells. Real signals on severity downgrading and false-positive tolerance under framing.
Read the framing-anchors-review-outcome writeup for results.
Browse the full data
- Interactive review viewer: side-by-side outputs with per-assertion grading evidence.
- Eval definitions: prompts and per-assertion criteria.
- Workspace: raw
output.md,grading.json,timing.jsonper run.