LLM bake-off across 6 judges: methodology + 7 surprising insights

TL;DR: I ran a bake-off across 6 LLMs on a single task (Knowledge-Audit cross-source contradiction). The winner was unexpected. Haiku — my default for lightweight tasks — finished last (13% accuracy as a direct auditor). Grok 4.3 beat Opus 4.7. Methodology + insights below.

Methodology — 5 components

1. Stratified eval set — not just 30 random cases.

192 cases bootstrapped via Haiku, split as:

Bucket × Lang stratified (5 × 3 = 15 strata)
Dev (93) for prompt iteration
Holdout (99) FROZEN — only run once at final decision time

→ Holdout = anti-overfit. Bake-off on dev → confirm on holdout.

2. Single task, multiple judges

Same SYSTEM_PROMPT_V3 (extract → compare → flag if different). Same user message format. Apples-to-apples.

Judge list:

Claude Haiku 4.5
Claude Sonnet 4.6
Claude Opus 4.7
xAI Grok 4-fast-reasoning
xAI Grok 4.3
Google Gemini Flash latest (3.5)
OpenAI GPT-5.4-mini

3. LLM-judge scorer (replaces strict substring match)

Each finding → Sonnet 4.6 verdict: VALID (real contradiction) or INVALID (paraphrase, wrong quote, over-inference).

Pass rule for positive cases: ≥1 finding VALID. Pass rule for negative cases: output = [].

4. Bootstrap CI — 1000 resamples, 95% confidence interval.

5. Pairwise Cohen’s κ — judge agreement (binary classifier outcome). Detects correlated bias.

Insight 1 — Haiku FAILS the audit task (13% accuracy)

I had hypothesized Haiku = good for “lightweight”. Reality: Haiku as a direct auditor severely under-detects.

Judge	Acc N=30 dev	Acc N=99 holdout
Haiku 4.5	13.3%	(not run, abandoned)
Sonnet 4.6	60%	—
Grok 4.3	83.3% dev → 99.0% holdout (LLM-judge)	—

Why: Haiku has a conservative prompt-following bias. The “be conservative” rule in the system prompt → Haiku interprets it as “default to []”. Misses real findings.

→ Haiku ≠ universal cheap option. Task-dependent. OK as a verifier, fails as a direct auditor.

Insight 2 — Grok 4.3 beats Opus 4.7

Judge	Accuracy	Provider
Grok 4.3	83%	xAI
Opus 4.7	67%	Anthropic flagship
Sonnet 4.6	60%	—

Opus = Anthropic’s most capable model. Grok 4.3 = xAI mid-tier. Grok beat Opus by +16pp on the audit task.

Hypothesis why: Opus is trained for broad reasoning and may over-think audit tasks (subtle vs explicit contradictions). Grok 4.3’s reasoning training is more “task-grounded”.

→ “Most capable” ≠ “best for your task”. Bake-off mandatory.

Insight 3 — Gemini fails entirely (99% error rate)

Gemini Flash latest = newest model (3.5 per the modelVersion field). Returned an empty candidates array on 99/99 cases. I hypothesize either a safety filter or an output format incompatibility.

→ Newest model ≠ usable. Sanity-test the API integration before the bake-off.

Insight 4 — Cohen’s kappa reveals correlated bias

Pair	κ	Note
Sonnet ↔ Opus	0.66	Same Anthropic family — correlated bias
Grok ↔ Sonnet	0.52	Independent providers
Grok ↔ OpenAI	0.55	Independent
Haiku ↔ Grok	−0.40	Anti-correlated — opposite errors

→ For ensembles: avoid same-family pairs. Pick across providers.

Insight 5 — VN/EN/mixed bucket reveals language bias

Grok 4.3 holdout-99:

VN: 7/8 = 88%
EN: 5/8 = 63%
Mixed: 10/14 = 71%

→ Grok is unexpectedly strong on VN, weak on EN for this task. Counter-intuitive.

Insight 6 — Code-mixed bucket = hardest

The real corpus has 28% of sentences mixing VN diacritics + EN tech terms (e.g. “em verify chưa được vì RAM 64GB không đủ load 235B MoE”). The “code_mixed” bucket = lowest accuracy across ALL judges.

→ Eval sets MUST include code-mixed cases. Pure VN/EN cases overestimate accuracy.

Insight 7 — Holdout reveals overfit

Grok 4.3 dev N=30 = 83.3%. Holdout N=99 = 79.8% (strict) / 99.0% (LLM-judge scoring). Strict scoring on holdout dropped 4pp vs dev → mild overfit confirmed.

→ Single-set eval is misleading. Always have a frozen holdout.

Bake-off cost note

Total methodology cost for 6-judge × N=30 = ~$3. Cheap enough to run a weekly bake-off when a vendor releases a new model.

Reusable framework

The whole stack is packaged inside a personal Eval-Framework. Adding a new judge = 1 file ~30 lines (provider client). Adding a new task = 1 YAML config.

Enterprise application

Bake-off methodology = mandatory before any LLM vendor decision at scale:

A bank choosing an LLM for compliance checks → task-specific bake-off
An e-commerce chatbot vendor migration → eval intent accuracy across N=200 customer queries
Healthcare summary generation → eval factuality across N=500 clinical notes

The methodology cost is trivial compared to vendor lock-in cost.