branch-promotion-discipline evaluations

The hotfix-with-forward-merge eval (+62pp) is the load-bearing one. Untrained models routinely propose hotfixing from main and skipping the post-hotfix forward-merge to develop, which produces the most common silent-bug-reintroduction failure mode of 3-tier flows.

The other three evals are non-discriminating in iteration-1: the prompts contain enough policy framing (e.g., “the repo’s flow is develop / uat / main with one-way promotion”) that even baseline models match the assertions. Iteration-2 will sharpen those prompts.

Headline result

Metric	With Skill	Baseline	Δ
Pass rate	100%	84%	+15pp

Per-eval breakdown

Eval	What it probes	With	Baseline	Δ
`set-up-3-tier-flow-from-single-main`	Designing the flow for a new repo: long-lived branches, promotion direction, CI gate matrix, branch protection, adoption sequence.	100%	100%	0pp
`hotfix-flow-with-forward-merge`	Production hotfix: branch from `uat` not `main`, PR sequence, regression-only exception, post-hotfix forward-merge to `develop`.	100%	38%	+62pp
`reject-pr-skipping-uat`	A PR is open from develop direct to main with “urgent” framing. Reject and propose the correct path.	100%	100%	0pp
`design-ci-gate-matrix`	Per-tier CI gates with rationale. Each tier strictly stricter than the previous.	100%	100%	0pp

The hotfix eval is the discriminator. The forward-merge step is a non-obvious “off the happy path” detail that untrained models miss systematically. Without the forward-merge, develop reintroduces the bug at the next normal promotion cycle. The skill encodes this as a hard rule because the failure mode is silent and recurring.

A note on the non-discriminating evals

Three of four prompts contain enough policy framing in the prompt itself that the baseline matches the assertions. For example, the reject-pr-skipping-uat prompt opens with “the repo’s flow is develop / uat / main with one-way promotion”, which is itself most of the answer. A baseline model parrots that framing back without needing the skill.

This is a design issue with the eval prompts, not the skill. Iteration-2 should:

Drop the policy-statement framing from those prompts. Phrase them as “we have these branches, what should we do?” without explaining what one-way promotion means.
Add a more diagnostic eval, e.g., a scenario where the team has been running 3-tier for a year but a hotfix-without-forward-merge is causing repeated reintroductions and the question is “what’s going wrong?”. The skill’s forward-merge rule should be the load-bearing answer.

Browse the full data

Interactive review viewer: side-by-side outputs with per-assertion grading evidence.
Eval definitions: prompts and per-assertion criteria.
Workspace: raw output.md, grading.json, timing.json per run.

Back to skill

branch-promotion-discipline skill page