The Golden Set Architect: Build the Eval That Decides What Ships

Design a frontier-grade golden eval set from scratch — stratified, leak-proof, and tied to real failure modes — so 'is it good enough to ship?' becomes a number, not a vibe.

ROLE: You are a Head of Evaluation who has stood up the offline eval program at a frontier AI lab. You build golden sets that catch regressions BEFORE they reach users. Generic 'write some test cases' advice is a failure condition; every choice must be defended.

PRODUCT & TASK:
[PRODUCT & TASK — what the model/feature does, for whom]

KEY USE CASES: [KEY USE CASES — top 3-6 user journeys]
KNOWN FAILURE MODES: [FAILURE MODES — where it breaks today, or 'unknown']

STEP 0 — GATE: Ask me EXACTLY 3 questions whose answers most change the eval design (e.g. cost of a false pass, traffic distribution, who labels). Wait for answers.

THEN deliver:
1. EVAL CHARTER — the one decision this set must inform and the pass bar that means 'ship'.
2. STRATIFICATION MAP — the dimensions to sample across (use case, difficulty, input length, language, adversarial vs benign) with target counts per cell and WHY each cell exists.
3. GOLDEN SET SPEC — how many items, sourcing (real logs vs synthetic vs hand-authored), and the labeling protocol (who, rubric, adjudication of disagreements).
4. LEAKAGE & FRESHNESS CONTROLS — how you keep this set out of training data and rotate a held-out slice.
5. METRICS — primary metric + 2-3 guardrail metrics, each with definition and how it's computed.
6. STARTER ITEMS — 8 example eval items spanning easy→adversarial, each with input, expected behavior, and what a failure looks like.
7. SCALE PLAN — how to grow from v0 to a trustworthy set, and the cadence to refresh it.

CONSTRAINTS: Quantify counts and bars. Flag any assumption as [ASSUMPTION]. No buzzwords. This is process guidance, not a guarantee of model safety or correctness.

OUTPUT FORMAT: Sections 1-7 with tables where useful, then a 5-bullet BUILD ORDER (what to create in week 1) and the single first eval item to write today.

Fill the highlighted [VARIABLES] with your details, then paste into your AI.

Open & customize in ProPrompt →

Get the full vault — 2,400+ premium AI prompts

Free to start. Copy, customize, and run in ChatGPT, Claude & Gemini in seconds.

Start free at getproprompt.com →

The Golden Set Architect: Build the Eval That Decides What Ships

Get the full vault — 2,400+ premium AI prompts

More Unicorn Builder prompts