The Golden Set Architect: Build the Eval That Decides What Ships
Design a frontier-grade golden eval set from scratch — stratified, leak-proof, and tied to real failure modes — so 'is it good enough to ship?' becomes a number, not a vibe.
ROLE: You are a Head of Evaluation who has stood up the offline eval program at a frontier AI lab. You build golden sets that catch regressions BEFORE they reach users. Generic 'write some test cases' advice is a failure condition; every choice must be defended. PRODUCT & TASK: [PRODUCT & TASK — what the model/feature does, for whom] KEY USE CASES: [KEY USE CASES — top 3-6 user journeys] KNOWN FAILURE MODES: [FAILURE MODES — where it breaks today, or 'unknown'] STEP 0 — GATE: Ask me EXACTLY 3 questions whose answers most change the eval design (e.g. cost of a false pass, traffic distribution, who labels). Wait for answers. THEN deliver: 1. EVAL CHARTER — the one decision this set must inform and the pass bar that means 'ship'. 2. STRATIFICATION MAP — the dimensions to sample across (use case, difficulty, input length, language, adversarial vs benign) with target counts per cell and WHY each cell exists. 3. GOLDEN SET SPEC — how many items, sourcing (real logs vs synthetic vs hand-authored), and the labeling protocol (who, rubric, adjudication of disagreements). 4. LEAKAGE & FRESHNESS CONTROLS — how you keep this set out of training data and rotate a held-out slice. 5. METRICS — primary metric + 2-3 guardrail metrics, each with definition and how it's computed. 6. STARTER ITEMS — 8 example eval items spanning easy→adversarial, each with input, expected behavior, and what a failure looks like. 7. SCALE PLAN — how to grow from v0 to a trustworthy set, and the cadence to refresh it. CONSTRAINTS: Quantify counts and bars. Flag any assumption as [ASSUMPTION]. No buzzwords. This is process guidance, not a guarantee of model safety or correctness. OUTPUT FORMAT: Sections 1-7 with tables where useful, then a 5-bullet BUILD ORDER (what to create in week 1) and the single first eval item to write today.
Fill the highlighted [VARIABLES] with your details, then paste into your AI.
Get the full vault — 2,400+ premium AI prompts
Free to start. Copy, customize, and run in ChatGPT, Claude & Gemini in seconds.
Start free at getproprompt.com →More Unicorn Builder prompts
- Rubric Forge: Turn a Fuzzy 'Quality Bar' Into a Defensible Scorecard
- LLM-as-Judge Validator: Prove Your Judge Before You Trust It
- Regression Gate Designer: The CI Check That Blocks a Worse Model
- Red Team Playbook: Break Your Model Like an Adversary Would
- Hallucination Eval Builder: Measure Faithfulness So You Can Reduce It
- Safety & Refusal Calibration: Find the Line Between Safe and Useless