Build a Golden Test Set From Production Failures
Mines your real failure logs into a curated, labeled golden test set — deduplicated, stratified, and frozen — so every prompt or model change is judged against the cases that actually broke, not a tidy demo set.
You are a data curation engineer who treats a test set as a frozen contract, not a folder of nice examples, and you know a golden set's value is destroyed the moment it's contaminated, unbalanced, or mislabeled. The system and its real failures: <failures> [FAILURES] </failures> How a correct output is defined and who's qualified to label it, plus any labeling guidelines you already have (optional): [LABELING] The distribution of real traffic you want the set to represent — common cases vs. rare-but-costly (optional): [DISTRIBUTION] Think first, silently: cluster the failures by root cause rather than surface symptom; decide which clusters are true model errors versus bad inputs or wrong expectations; and plan a stratified sample that mirrors real traffic while over-weighting rare high-stakes cases on purpose. Then specify the golden set build. CONSTRAINTS: Deduplicate near-identical cases so one bug doesn't dominate the score, and record the original frequency so weighting stays honest. Stratify across the real input distribution and document the intended mix. Write a labeling protocol precise enough that two annotators agree — define the answer key, the allowed tolerance, and the tie-break rule; require a second-pass review on a sample and report expected agreement. Freeze the set with a version and a changelog; specify how new cases get added without silently shifting the baseline. Separate this evaluation set strictly from any training or few-shot data and state how you prevent leakage. Flag cases where the 'correct' answer is contested and how you handle them. Do not fabricate examples or labels — work only from what's provided and mark gaps to fill. OUTPUT FORMAT: (1) 'Failure clusters' — root-cause groups with counts and which are true model errors. (2) 'Sampling plan' — strata, target mix, dedup rule, and rare-case over-weighting. (3) 'Labeling protocol' — answer-key definition, tolerance, tie-breaks, second-pass review, expected agreement. (4) 'Set composition' — final case counts per slice and version tag. (5) 'Contamination guard' — how train/eval separation and leakage prevention are enforced. (6) 'Maintenance rule' — how cases are added or retired without breaking comparability. (7) 'Open labels' — contested cases needing a human ruling.
Fill the highlighted [VARIABLES] with your details, then paste into your AI.
Get the full vault — 2,400+ premium AI prompts
Free to start. Copy, customize, and run in ChatGPT, Claude & Gemini in seconds.
Start free at getproprompt.com →More Tech & Engineering prompts
- Design an Eval Set That Catches Regressions Before Users Do
- Write a Model Card That Survives Legal, Risk, and Reuse
- Spec a RAG Pipeline That Doesn't Hallucinate Citations
- Turn a Prompt Into Versioned, Testable Prompt-as-Code
- Design Guardrails for an LLM Feature Going to Real Users
- Write the Runbook for an LLM Service On-Call Page