● Shipped P0 Size M Foundation

Eval-Framework — LLM Bake-off & Production Judge

A personal eval harness that turns LLM model + prompt + scorer decisions from gut-feel guessing into evidence-based answers via stratified eval sets, multi-provider bake-offs, and LLM-as-judge scoring.

A production-grade personal eval framework for every AI feature in the portfolio. Converts three recurring PM questions — should we migrate from model A to model B?, which prompts break after the swap?, are local LLMs ready to replace cloud APIs? — from “2-3 weeks of guessing against public benchmarks” into “evidence-based answers in a few hours” on a task-specific eval set.

At a glance

  • 6-model bake-off shipped — Claude Haiku 4.5, Sonnet 4.6, Opus 4.7, Grok 4-fast-reasoning, Grok 4.3, GPT-5.4-mini, Gemini 2.5 Flash
  • Stratified eval set: 192 cases bootstrapped via Haiku — 5 buckets × 3 languages (15 strata) — split 93 dev + 99 holdout (FROZEN)
  • Scoring breakthrough: strict substring scoring = 79.8%, LLM-judged scoring = 99.0% true accuracy on holdout-99 — same outputs, same judge model, different scorer
  • Production judge picked: Grok 4.3 wins Knowledge-Audit at 99% — beating Opus 4.7 (67%) and Haiku as direct auditor (last at 13%)
  • Production cost: $0.61/month for Grok 4.3 at audit volume (15K audits/mo SMB-scale also lands at ~$90/mo all-in)
  • Foundation for 8 personal projects — Knowledge-Audit, Mail-Assistant, Mac-Translator, Voice-Assistant, Email-Filter, Personal-RAG, Diagram-Engine, Eval-Framework itself
  • Methodology = 5 components — stratified eval, single task × multi judges, LLM-as-judge for multi-valid-output tasks, pairwise Cohen’s κ for correlated bias detection, frozen holdout protection
  • Shipped F1+F2+F3 in a 10-hour day (2026-05-21) — single-day sprint from CLI scaffold to verified production decision

Stack

Python 3.11 · pytest (test runner + parametrize for eval cases) · Pydantic (task config schema) · Anthropic SDK · xAI SDK (Grok) · OpenAI SDK (GPT-5.4-mini) · Google Gemini SDK · Postgres 16 (eval-run audit log) · scikit-learn (Cohen’s kappa pairwise judge agreement) · numpy (bootstrap CI 1000 resamples)

Documentation

DocRead this for
PRDWhat & why — problem framing, JTBD, scope, milestones, success metrics
ArchitectureSystem diagrams, eval pipeline, judge adapter pattern, scoring layer
ImplementationTech stack, code structure, schema, perf, reproducibility steps
NotesChronological decision log + gotchas + the 7 PM-bias catches
Enterprise5 enterprise use cases — B2B SaaS chatbot eval, fintech reconciliation, EdTech moderation, healthcare triage, multi-tenant LLM migration

Quickstart for users

# 1. Bake-off across N judges on a task
eval-framework bake-off \
  --task knowledge_audit \
  --judges grok-4.3,claude-haiku-4.5,claude-sonnet-4.6,gpt-5.4-mini \
  --eval-set holdout-99 \
  --scorer llm_judge

# 2. Output: per-judge accuracy + bootstrap CI 95% + Cohen's κ matrix +
#    cost/case + p95 latency + per-stratum breakdown

Adding a new judge = 1 file (~30 LOC provider client). Adding a new task = 1 YAML config (system prompt + scoring rubric + eval set path).

Project status

PhaseMilestone
F1CLI scaffold + task config schema + Anthropic + xAI adapters
F2OpenAI + Gemini adapters; LLM-judge scorer; bootstrap CI
F3Stratified 192-case eval set; dev/holdout split; pairwise κ; first verified production pick (Grok 4.3 for Knowledge-Audit)
F4Apply to Personal-RAG (retrieval quality eval) — DONE 2026-05-21, Hit@3=97.8% / MRR=0.948
F5Mail-Assistant inbox classification eval — queued
F6Voice-Assistant tool-use accuracy eval — queued
F74-tier triage routing (daily=4B+Haiku, weekly=8B+Haiku, monthly=32B, on-demand=Grok 4.3) — design done, distill blocked (4 Metal crashes, retrain weekend)

Total build time: F1+F2+F3 shipped in ~10 hours (single day, 2026-05-21). F4 = +4 h.

Foundation for downstream projects

This framework is shared infrastructure for every LLM-powered side project. Each project gets:

  • A reproducible eval set (frozen holdout)
  • A bake-off command to pick the right model on its real task
  • A scoring rubric (LLM-judge by default) that survives prompt iteration
  • A production cost number (per-case + per-month) before commit
  • A weekly drift-monitoring cron with Telegram alerts (≥5pp regression → P0)

Every model decision in the portfolio runs through this framework. No “default to Haiku because I know the SDK” choices anymore.

📚

STACK

  • Python 3.11
  • pytest
  • Pydantic
  • Anthropic SDK
  • xAI Grok SDK
  • OpenAI SDK
  • Google Gemini SDK
  • Postgres 16
  • scikit-learn (Cohen's kappa)