← All posts
📅

LLM bake-off across 6 judges: methodology + 7 surprising insights

Tested Claude (Haiku/Sonnet/Opus) vs Grok 4.3 vs GPT-5.4-mini vs Gemini Flash. N=30 dev + N=99 holdout. Stratified eval. Pairwise Cohen's kappa. Surprising winner.

TL;DR: I ran a bake-off across 6 LLMs on a single task (Knowledge-Audit cross-source contradiction). The winner was unexpected. Haiku — my default for lightweight tasks — finished last (13% accuracy as a direct auditor). Grok 4.3 beat Opus 4.7. Methodology + insights below.

Methodology — 5 components

1. Stratified eval set — not just 30 random cases.

192 cases bootstrapped via Haiku, split as:

  • Bucket × Lang stratified (5 × 3 = 15 strata)
  • Dev (93) for prompt iteration
  • Holdout (99) FROZEN — only run once at final decision time

→ Holdout = anti-overfit. Bake-off on dev → confirm on holdout.

2. Single task, multiple judges

Same SYSTEM_PROMPT_V3 (extract → compare → flag if different). Same user message format. Apples-to-apples.

Judge list:

  • Claude Haiku 4.5
  • Claude Sonnet 4.6
  • Claude Opus 4.7
  • xAI Grok 4-fast-reasoning
  • xAI Grok 4.3
  • Google Gemini Flash latest (3.5)
  • OpenAI GPT-5.4-mini

3. LLM-judge scorer (replaces strict substring match)

Each finding → Sonnet 4.6 verdict: VALID (real contradiction) or INVALID (paraphrase, wrong quote, over-inference).

Pass rule for positive cases: ≥1 finding VALID. Pass rule for negative cases: output = [].

4. Bootstrap CI — 1000 resamples, 95% confidence interval.

5. Pairwise Cohen’s κ — judge agreement (binary classifier outcome). Detects correlated bias.

Insight 1 — Haiku FAILS the audit task (13% accuracy)

I had hypothesized Haiku = good for “lightweight”. Reality: Haiku as a direct auditor severely under-detects.

JudgeAcc N=30 devAcc N=99 holdout
Haiku 4.513.3%(not run, abandoned)
Sonnet 4.660%
Grok 4.383.3% dev → 99.0% holdout (LLM-judge)

Why: Haiku has a conservative prompt-following bias. The “be conservative” rule in the system prompt → Haiku interprets it as “default to []”. Misses real findings.

Haiku ≠ universal cheap option. Task-dependent. OK as a verifier, fails as a direct auditor.

Insight 2 — Grok 4.3 beats Opus 4.7

JudgeAccuracyProvider
Grok 4.383%xAI
Opus 4.767%Anthropic flagship
Sonnet 4.660%

Opus = Anthropic’s most capable model. Grok 4.3 = xAI mid-tier. Grok beat Opus by +16pp on the audit task.

Hypothesis why: Opus is trained for broad reasoning and may over-think audit tasks (subtle vs explicit contradictions). Grok 4.3’s reasoning training is more “task-grounded”.

→ “Most capable” ≠ “best for your task”. Bake-off mandatory.

Insight 3 — Gemini fails entirely (99% error rate)

Gemini Flash latest = newest model (3.5 per the modelVersion field). Returned an empty candidates array on 99/99 cases. I hypothesize either a safety filter or an output format incompatibility.

→ Newest model ≠ usable. Sanity-test the API integration before the bake-off.

Insight 4 — Cohen’s kappa reveals correlated bias

PairκNote
Sonnet ↔ Opus0.66Same Anthropic family — correlated bias
Grok ↔ Sonnet0.52Independent providers
Grok ↔ OpenAI0.55Independent
Haiku ↔ Grok−0.40Anti-correlated — opposite errors

→ For ensembles: avoid same-family pairs. Pick across providers.

Insight 5 — VN/EN/mixed bucket reveals language bias

Grok 4.3 holdout-99:

  • VN: 7/8 = 88%
  • EN: 5/8 = 63%
  • Mixed: 10/14 = 71%

→ Grok is unexpectedly strong on VN, weak on EN for this task. Counter-intuitive.

Insight 6 — Code-mixed bucket = hardest

The real corpus has 28% of sentences mixing VN diacritics + EN tech terms (e.g. “em verify chưa được vì RAM 64GB không đủ load 235B MoE”). The “code_mixed” bucket = lowest accuracy across ALL judges.

→ Eval sets MUST include code-mixed cases. Pure VN/EN cases overestimate accuracy.

Insight 7 — Holdout reveals overfit

Grok 4.3 dev N=30 = 83.3%. Holdout N=99 = 79.8% (strict) / 99.0% (LLM-judge scoring). Strict scoring on holdout dropped 4pp vs dev → mild overfit confirmed.

→ Single-set eval is misleading. Always have a frozen holdout.

Bake-off cost note

Total methodology cost for 6-judge × N=30 = ~$3. Cheap enough to run a weekly bake-off when a vendor releases a new model.

Reusable framework

The whole stack is packaged inside a personal Eval-Framework. Adding a new judge = 1 file ~30 lines (provider client). Adding a new task = 1 YAML config.

Enterprise application

Bake-off methodology = mandatory before any LLM vendor decision at scale:

  • A bank choosing an LLM for compliance checks → task-specific bake-off
  • An e-commerce chatbot vendor migration → eval intent accuracy across N=200 customer queries
  • Healthcare summary generation → eval factuality across N=500 clinical notes

The methodology cost is trivial compared to vendor lock-in cost.