TL;DR: I ran a bake-off across 6 LLMs on a single task (Knowledge-Audit cross-source contradiction). The winner was unexpected. Haiku — my default for lightweight tasks — finished last (13% accuracy as a direct auditor). Grok 4.3 beat Opus 4.7. Methodology + insights below.
Methodology — 5 components
1. Stratified eval set — not just 30 random cases.
192 cases bootstrapped via Haiku, split as:
- Bucket × Lang stratified (5 × 3 = 15 strata)
- Dev (93) for prompt iteration
- Holdout (99) FROZEN — only run once at final decision time
→ Holdout = anti-overfit. Bake-off on dev → confirm on holdout.
2. Single task, multiple judges
Same SYSTEM_PROMPT_V3 (extract → compare → flag if different). Same user message format. Apples-to-apples.
Judge list:
- Claude Haiku 4.5
- Claude Sonnet 4.6
- Claude Opus 4.7
- xAI Grok 4-fast-reasoning
- xAI Grok 4.3
- Google Gemini Flash latest (3.5)
- OpenAI GPT-5.4-mini
3. LLM-judge scorer (replaces strict substring match)
Each finding → Sonnet 4.6 verdict: VALID (real contradiction) or INVALID (paraphrase, wrong quote, over-inference).
Pass rule for positive cases: ≥1 finding VALID.
Pass rule for negative cases: output = [].
4. Bootstrap CI — 1000 resamples, 95% confidence interval.
5. Pairwise Cohen’s κ — judge agreement (binary classifier outcome). Detects correlated bias.
Insight 1 — Haiku FAILS the audit task (13% accuracy)
I had hypothesized Haiku = good for “lightweight”. Reality: Haiku as a direct auditor severely under-detects.
| Judge | Acc N=30 dev | Acc N=99 holdout |
|---|---|---|
| Haiku 4.5 | 13.3% | (not run, abandoned) |
| Sonnet 4.6 | 60% | — |
| Grok 4.3 | 83.3% dev → 99.0% holdout (LLM-judge) | — |
Why: Haiku has a conservative prompt-following bias. The “be conservative” rule in the system prompt → Haiku interprets it as “default to []”. Misses real findings.
→ Haiku ≠ universal cheap option. Task-dependent. OK as a verifier, fails as a direct auditor.
Insight 2 — Grok 4.3 beats Opus 4.7
| Judge | Accuracy | Provider |
|---|---|---|
| Grok 4.3 | 83% | xAI |
| Opus 4.7 | 67% | Anthropic flagship |
| Sonnet 4.6 | 60% | — |
Opus = Anthropic’s most capable model. Grok 4.3 = xAI mid-tier. Grok beat Opus by +16pp on the audit task.
Hypothesis why: Opus is trained for broad reasoning and may over-think audit tasks (subtle vs explicit contradictions). Grok 4.3’s reasoning training is more “task-grounded”.
→ “Most capable” ≠ “best for your task”. Bake-off mandatory.
Insight 3 — Gemini fails entirely (99% error rate)
Gemini Flash latest = newest model (3.5 per the modelVersion field). Returned an empty candidates array on 99/99 cases. I hypothesize either a safety filter or an output format incompatibility.
→ Newest model ≠ usable. Sanity-test the API integration before the bake-off.
Insight 4 — Cohen’s kappa reveals correlated bias
| Pair | κ | Note |
|---|---|---|
| Sonnet ↔ Opus | 0.66 | Same Anthropic family — correlated bias |
| Grok ↔ Sonnet | 0.52 | Independent providers |
| Grok ↔ OpenAI | 0.55 | Independent |
| Haiku ↔ Grok | −0.40 | Anti-correlated — opposite errors |
→ For ensembles: avoid same-family pairs. Pick across providers.
Insight 5 — VN/EN/mixed bucket reveals language bias
Grok 4.3 holdout-99:
- VN: 7/8 = 88%
- EN: 5/8 = 63%
- Mixed: 10/14 = 71%
→ Grok is unexpectedly strong on VN, weak on EN for this task. Counter-intuitive.
Insight 6 — Code-mixed bucket = hardest
The real corpus has 28% of sentences mixing VN diacritics + EN tech terms (e.g. “em verify chưa được vì RAM 64GB không đủ load 235B MoE”). The “code_mixed” bucket = lowest accuracy across ALL judges.
→ Eval sets MUST include code-mixed cases. Pure VN/EN cases overestimate accuracy.
Insight 7 — Holdout reveals overfit
Grok 4.3 dev N=30 = 83.3%. Holdout N=99 = 79.8% (strict) / 99.0% (LLM-judge scoring). Strict scoring on holdout dropped 4pp vs dev → mild overfit confirmed.
→ Single-set eval is misleading. Always have a frozen holdout.
Bake-off cost note
Total methodology cost for 6-judge × N=30 = ~$3. Cheap enough to run a weekly bake-off when a vendor releases a new model.
Reusable framework
The whole stack is packaged inside a personal Eval-Framework. Adding a new judge = 1 file ~30 lines (provider client). Adding a new task = 1 YAML config.
Enterprise application
Bake-off methodology = mandatory before any LLM vendor decision at scale:
- A bank choosing an LLM for compliance checks → task-specific bake-off
- An e-commerce chatbot vendor migration → eval intent accuracy across N=200 customer queries
- Healthcare summary generation → eval factuality across N=500 clinical notes
The methodology cost is trivial compared to vendor lock-in cost.