structured-code-review

A review-only output format for code reviews. Domain skills bring opinions about what to flag; this skill brings the discipline of how to present findings so the author can triage at a glance and the reviewer can be held accountable to a named source of truth.

What makes this skill distinct

Reviews are review-only by default. The reviewer’s job is to surface, name severity, cite source-of-truth, and propose alignment, not to apply fixes. Don’t edit files unless the user explicitly asks.
Findings: header + 8-field preamble. Every review starts the same way. Findings: on its own line, then Review Scope:, Process Context:, Execution Context:, Integration Target:, Design Reference:, Architecture Reference:, Feature Specification Reference:, PRD Reference:, in that order, even on a one-line change.
Severity tags in backticks, ordered highest first. `High`, `Medium`, `Low`. The author scans severities down the left margin and triages in seconds. Domain skills MAY add `Critical` for production-blocker class issues (see Severity ladder).
No-findings format is identical to findings format. Even on a clean change, the preamble emits and Residual Risks / Gaps: is mandatory. That’s what proves the review actually happened versus an unaccountable “looks good.”
Per-finding Source of truth: discipline. Every finding names what it’s checking against: task design, architecture spec, feature specification, PRD, lightweight task context, or general code quality. Reviews that don’t name a source of truth aren’t reviews; they’re opinions.
Don’t cite the skill in the output. The skill is a reference for the reviewer; the audience for the output is the PR author. Reviews argue from first principles, not from the skill’s rulebook.

What it covers

Required preparation before writing findings
Eight-field preamble with None applicable / Unable to determine / No task identified as valid values
Per-finding format: severity tag in backticks, file:line citation, Problem/Why/Source of truth/Proposed fix structure
Three-tier severity ladder (High / Medium / Low), with optional Critical for domain skills
No-findings case (preamble still emitted, plus Residual Risks / Gaps:)
Hard rules against summary-before-findings, fabricated tasks, silent source-of-truth skips, applying fixes when only review was requested

Quick install

Inside Claude Code:

/plugin marketplace add ribrewguy/agent-skills
/plugin install structured-code-review@ribrewguy-skills

For other tools, see Install.

Composes with

rest-api-design: domain review opinions for HTTP REST APIs. Pair with this skill. rest-api-design identifies the violations; structured-code-review structures the output.
(planned) multi-agent-git-workflow: branch and role vocabulary used in Execution Context: and Integration Target: preamble fields.
(planned) task-handoff-summaries: the report format that sits upstream of code review (the implementation summary the author hands you before you review).

Tooling and dependencies

Required: none. The format is tool-neutral.
Strongly recommended: a task / issue tracker (Beads, GitHub Issues, Linear, Jira, Shortcut). Without one, every review’s Review Scope: is No task identified and the skill falls back to lightweight-process review.
Strongly recommended: source-of-truth artifacts (design docs, architecture specs, feature specifications, PRDs). Reviews are stronger when they can name what they’re checking against.
Adapter: Beads. The skill body has a dedicated section showing the literal mapping (Review Scope: Bead proj-42, bd show <id> for source-of-truth lookups). Other trackers substitute their own ID format.

Source of truth

Full SKILL.md on GitHub: the canonical reference loaded by AI tools.
Eval set on GitHub: the four test cases used to verify the skill’s behavior.

Eval results

Full per-eval breakdown, interactive review viewer, and links to raw model outputs: structured-code-review evaluations.

Iteration-1 benchmark: 100% pass rate with-skill vs. 31% baseline across four test cases (+69pp delta, the largest skill effect we’ve measured in this collection):

Eval	What it probes
`mixed-severity-pr-review`	Does the skill produce the eight-field preamble, sort findings by severity, identify a realistic mix of issues across security and convention?
`no-findings-clean-change`	Does the skill emit the preamble + `No findings.` + `Residual Risks / Gaps:` even when there are zero findings?
`no-task-no-design-fallback`	Does the skill correctly emit `Review Scope: No task identified` and review against lightweight-process context without fabricating a task?
`rest-domain-composition`	Does the skill compose with rest-api-design, applying the format to REST domain findings?

Eval transcripts and benchmark JSON live alongside the skill source.

Invocation examples

“Review this PR using structured-code-review’s format.”
“Audit this implementation against the design at docs/specs/auth-flow.md. Tag findings with severity.”
“Code review the changes on this branch. What doesn’t match the PRD?”
“Is this PR ready to merge?”
“Use rest-api-design + structured-code-review to review the new endpoint.”