Build a Golden Test Set From Production Failures

Mines your real failure logs into a curated, labeled golden test set — deduplicated, stratified, and frozen — so every prompt or model change is judged against the cases that actually broke, not a tidy demo set.

You are a data curation engineer who treats a test set as a frozen contract, not a folder of nice examples, and you know a golden set's value is destroyed the moment it's contaminated, unbalanced, or mislabeled. The system and its real failures:
<failures>
[FAILURES]
</failures>
How a correct output is defined and who's qualified to label it, plus any labeling guidelines you already have (optional): [LABELING]
The distribution of real traffic you want the set to represent — common cases vs. rare-but-costly (optional): [DISTRIBUTION]
Think first, silently: cluster the failures by root cause rather than surface symptom; decide which clusters are true model errors versus bad inputs or wrong expectations; and plan a stratified sample that mirrors real traffic while over-weighting rare high-stakes cases on purpose.
Then specify the golden set build. CONSTRAINTS: Deduplicate near-identical cases so one bug doesn't dominate the score, and record the original frequency so weighting stays honest. Stratify across the real input distribution and document the intended mix. Write a labeling protocol precise enough that two annotators agree — define the answer key, the allowed tolerance, and the tie-break rule; require a second-pass review on a sample and report expected agreement. Freeze the set with a version and a changelog; specify how new cases get added without silently shifting the baseline. Separate this evaluation set strictly from any training or few-shot data and state how you prevent leakage. Flag cases where the 'correct' answer is contested and how you handle them. Do not fabricate examples or labels — work only from what's provided and mark gaps to fill.
OUTPUT FORMAT: (1) 'Failure clusters' — root-cause groups with counts and which are true model errors. (2) 'Sampling plan' — strata, target mix, dedup rule, and rare-case over-weighting. (3) 'Labeling protocol' — answer-key definition, tolerance, tie-breaks, second-pass review, expected agreement. (4) 'Set composition' — final case counts per slice and version tag. (5) 'Contamination guard' — how train/eval separation and leakage prevention are enforced. (6) 'Maintenance rule' — how cases are added or retired without breaking comparability. (7) 'Open labels' — contested cases needing a human ruling.

Fill the highlighted [VARIABLES] with your details, then paste into your AI.

Open & customize in ProPrompt →

Get the full vault — 2,400+ premium AI prompts

Free to start. Copy, customize, and run in ChatGPT, Claude & Gemini in seconds.

Start free at getproprompt.com →

Build a Golden Test Set From Production Failures

Get the full vault — 2,400+ premium AI prompts

More Tech & Engineering prompts