Eval-Framework — LLM Bake-off & Production Judge

A production-grade personal eval framework for every AI feature in the portfolio. Converts three recurring PM questions — should we migrate from model A to model B?, which prompts break after the swap?, are local LLMs ready to replace cloud APIs? — from “2-3 weeks of guessing against public benchmarks” into “evidence-based answers in a few hours” on a task-specific eval set.

At a glance

6-model bake-off shipped — Claude Haiku 4.5, Sonnet 4.6, Opus 4.7, Grok 4-fast-reasoning, Grok 4.3, GPT-5.4-mini, Gemini 2.5 Flash
Stratified eval set: 192 cases bootstrapped via Haiku — 5 buckets × 3 languages (15 strata) — split 93 dev + 99 holdout (FROZEN)
Scoring breakthrough: strict substring scoring = 79.8%, LLM-judged scoring = 99.0% true accuracy on holdout-99 — same outputs, same judge model, different scorer
Production judge picked: Grok 4.3 wins Knowledge-Audit at 99% — beating Opus 4.7 (67%) and Haiku as direct auditor (last at 13%)
Production cost: $0.61/month for Grok 4.3 at audit volume (15K audits/mo SMB-scale also lands at ~$90/mo all-in)
Foundation for 8 personal projects — Knowledge-Audit, Mail-Assistant, Mac-Translator, Voice-Assistant, Email-Filter, Personal-RAG, Diagram-Engine, Eval-Framework itself
Methodology = 5 components — stratified eval, single task × multi judges, LLM-as-judge for multi-valid-output tasks, pairwise Cohen’s κ for correlated bias detection, frozen holdout protection
Shipped F1+F2+F3 in a 10-hour day (2026-05-21) — single-day sprint from CLI scaffold to verified production decision

Stack

Python 3.11 · pytest (test runner + parametrize for eval cases) · Pydantic (task config schema) · Anthropic SDK · xAI SDK (Grok) · OpenAI SDK (GPT-5.4-mini) · Google Gemini SDK · Postgres 16 (eval-run audit log) · scikit-learn (Cohen’s kappa pairwise judge agreement) · numpy (bootstrap CI 1000 resamples)

Documentation

Doc	Read this for
PRD	What & why — problem framing, JTBD, scope, milestones, success metrics
Architecture	System diagrams, eval pipeline, judge adapter pattern, scoring layer
Implementation	Tech stack, code structure, schema, perf, reproducibility steps
Notes	Chronological decision log + gotchas + the 7 PM-bias catches
Enterprise	5 enterprise use cases — B2B SaaS chatbot eval, fintech reconciliation, EdTech moderation, healthcare triage, multi-tenant LLM migration

Quickstart for users

# 1. Bake-off across N judges on a task
eval-framework bake-off \
  --task knowledge_audit \
  --judges grok-4.3,claude-haiku-4.5,claude-sonnet-4.6,gpt-5.4-mini \
  --eval-set holdout-99 \
  --scorer llm_judge

# 2. Output: per-judge accuracy + bootstrap CI 95% + Cohen's κ matrix +
#    cost/case + p95 latency + per-stratum breakdown

Adding a new judge = 1 file (~30 LOC provider client). Adding a new task = 1 YAML config (system prompt + scoring rubric + eval set path).

Project status

Phase	Milestone
F1	CLI scaffold + task config schema + Anthropic + xAI adapters
F2	OpenAI + Gemini adapters; LLM-judge scorer; bootstrap CI
F3	Stratified 192-case eval set; dev/holdout split; pairwise κ; first verified production pick (Grok 4.3 for Knowledge-Audit)
F4	Apply to Personal-RAG (retrieval quality eval) — DONE 2026-05-21, Hit@3=97.8% / MRR=0.948
F5	Mail-Assistant inbox classification eval — queued
F6	Voice-Assistant tool-use accuracy eval — queued
F7	4-tier triage routing (daily=4B+Haiku, weekly=8B+Haiku, monthly=32B, on-demand=Grok 4.3) — design done, distill blocked (4 Metal crashes, retrain weekend)

Total build time: F1+F2+F3 shipped in ~10 hours (single day, 2026-05-21). F4 = +4 h.

Foundation for downstream projects

This framework is shared infrastructure for every LLM-powered side project. Each project gets:

A reproducible eval set (frozen holdout)
A bake-off command to pick the right model on its real task
A scoring rubric (LLM-judge by default) that survives prompt iteration
A production cost number (per-case + per-month) before commit
A weekly drift-monitoring cron with Telegram alerts (≥5pp regression → P0)

Every model decision in the portfolio runs through this framework. No “default to Haiku because I know the SDK” choices anymore.