Implementation
Sister docs: PRD (intent), Architecture (system view), Notes (decision log).
TL;DR
A production-grade personal eval framework in continuous use across the side-project portfolio:
- 6-model bake-off shipped (Claude Haiku 4.5, Sonnet 4.6, Opus 4.7, Grok 4-fast-reasoning, Grok 4.3, GPT-5.4-mini, Gemini 2.5 Flash)
- Stratified 192-case eval set (5 buckets × 3 languages); 93 dev + 99 holdout (FROZEN)
- LLM-as-judge scorer (Grok 4.3 default) — replaces strict substring on multi-valid-output tasks; verified +19.2pp accuracy gap on holdout-99
- Production judge picked: Grok 4.3 wins Knowledge-Audit at 99.0% on holdout-99, $0.61/mo at audit volume
- Statistics: bootstrap CI 1000 resamples (95%) + pairwise Cohen’s κ for correlated-bias detection
- CLI:
eval-framework bake-off | score | verify+ Python API for notebooks - F1+F2+F3 shipped in a 10-hour single-day sprint (2026-05-21)
- Applied to 2 projects so far: Knowledge-Audit (production) + Personal-RAG (Hit@3=97.8%, MRR=0.948 verified 2026-05-21); 6 more queued
- Drift monitor: weekly launchd cron + Telegram P0 alert on ≥5pp regression
- Audit log: Postgres
eval_runstable shared with Personal-RAG infra
Stack
| Layer | Component | Version / Notes |
|---|---|---|
| Runtime | Python | 3.11 + venv |
| CLI | click | sub-commands: bake-off, score, verify, inspect |
| Config schema | Pydantic | v2; YAML loader via pyyaml |
| Async runtime | asyncio + anyio | semaphore-bounded concurrency=8 |
| Provider SDKs | anthropic · xai-sdk · openai · google-genai | official clients per provider |
| Statistics | numpy + scikit-learn | bootstrap resample; cohen_kappa_score |
| Audit log | Postgres 16 | shared with Personal-RAG; table eval_runs |
| Scheduler | launchd | ai.eval-framework.drift-check.plist (weekly Sat 03:00) |
| Alerting | Telegram Bot API | eval-framework-bot |
| Test runner | pytest | parametrized over eval cases for ad-hoc inspection |
Directory layout
Repo (~/Documents/Side.Projects/eval-framework/)
src/eval_framework/
├── cli.py # click entrypoints
├── runner.py # asyncio fan-out + retry + holdout guard
├── config.py # Pydantic schemas (TaskConfig, EvalCase, ...)
├── adapters/
│ ├── base.py # JudgeAdapter protocol + CompletionResult
│ ├── anthropic_adapter.py # ~35 LOC
│ ├── xai_adapter.py # ~30 LOC
│ ├── openai_adapter.py # ~30 LOC
│ ├── gemini_adapter.py # ~40 LOC (handles empty candidates)
│ └── mlx_adapter.py # config-shaped stub, deferred wire-up
├── scoring/
│ ├── strict.py # substring match
│ ├── llm_judge.py # judge call + verdict cache
│ └── rubrics.py # built-in rubric templates
├── stats/
│ ├── bootstrap.py # 1000-resample CI
│ ├── kappa.py # pairwise Cohen's κ matrix
│ └── stratify.py # per-(bucket, lang) aggregation
├── reporter.py # rich-table render + Postgres write
├── prices.py # static $/Mtok table per (provider, model)
└── drift.py # weekly verify cron entry
tasks/
├── knowledge_audit.yaml # production task
├── personal_rag_retrieval.yaml # F4 application
├── mail_classify.yaml # queued
└── voice_tool_use.yaml # queued
eval/
├── knowledge_audit/
│ ├── dev-93.yaml # iterable
│ ├── holdout-99.yaml # FROZEN
│ ├── holdout-30-sample.yaml # weekly drift subset
│ └── holdout-runs.log # tamper-evident
└── personal_rag/
└── personal-93.yaml # 93-query held-out personal eval
schema/
└── eval_runs.sql
launchd/
└── ai.eval-framework.drift-check.plist
Schema
CREATE TABLE eval_runs (
id BIGSERIAL PRIMARY KEY,
task_name TEXT NOT NULL,
eval_set_name TEXT NOT NULL, -- 'dev-93' | 'holdout-99' | ...
judge_model TEXT NOT NULL, -- 'grok-4.3' | 'claude-haiku-4.5' | ...
scorer_mode TEXT NOT NULL, -- 'llm_judge' | 'strict_substring'
judge_for_scoring TEXT, -- 'claude-sonnet-4-6' | 'grok-4.3' | NULL
n_cases INT NOT NULL,
n_pass INT NOT NULL,
accuracy NUMERIC(5,4) NOT NULL, -- 0.0000 - 1.0000
ci_low NUMERIC(5,4),
ci_high NUMERIC(5,4),
p95_latency_ms INT,
total_cost_usd NUMERIC(8,4),
per_stratum JSONB, -- {"positive_explicit_VN": 0.88, ...}
run_type TEXT NOT NULL, -- 'bake-off' | 'final-decision' | 'drift_check'
git_sha TEXT,
started_at TIMESTAMPTZ NOT NULL,
finished_at TIMESTAMPTZ NOT NULL,
notes TEXT
);
CREATE INDEX idx_eval_runs_task ON eval_runs(task_name, finished_at DESC);
CREATE INDEX idx_eval_runs_drift ON eval_runs(task_name, judge_model, run_type, finished_at DESC);
CREATE TABLE eval_kappa (
eval_run_id_a BIGINT REFERENCES eval_runs(id),
eval_run_id_b BIGINT REFERENCES eval_runs(id),
kappa NUMERIC(5,4) NOT NULL,
PRIMARY KEY (eval_run_id_a, eval_run_id_b)
);
Why this shape:
- One row per (judge × eval-set × run) = grain matches the reporter table
per_stratumJSONB → flexible without per-task schema churn(task_name, judge_model, run_type)index → drift cron queries last-4-week median in millisecondseval_kappaseparate table → avoids N² explosion ineval_runsgit_sha→ reproduce any historical run
Task config schema (Pydantic)
class ScoringConfig(BaseModel):
mode: Literal["llm_judge", "strict_substring"]
judge_model: str | None = None # required if mode == llm_judge
rubric: str | None = None # required if mode == llm_judge
pass_rule: Literal[
"exact_match",
"any_substring",
"positive_at_least_one_valid_or_empty_on_negative",
] = "exact_match"
class TaskConfig(BaseModel):
name: str
description: str
system_prompt: str
user_template: str # str.format-style
scoring: ScoringConfig
default_judges: list[str] = []
max_tokens: int = 2048
temperature: float = 0.0
class EvalCase(BaseModel):
id: str
inputs: dict[str, Any] # interpolated into user_template
expected: dict[str, Any] | None = None
stratum: dict[str, str] # {"bucket": "positive_subtle", "lang": "VN"}
expected_type: Literal["positive", "negative"] = "positive"
MCP / CLI surface
| Command | Args | Purpose |
|---|---|---|
eval-framework bake-off | --task --judges --eval-set --scorer | Multi-judge bake-off on dev set; renders table; writes audit row per judge |
eval-framework bake-off --final-decision | + --eval-set holdout-* | One-pass holdout run; appends to holdout-runs.log |
eval-framework score | --run-id --rescore | Re-score existing raw outputs with a different scorer (cache hit) |
eval-framework verify | --task --judges --eval-set | Drift check; quiet stdout, exits non-zero on regression |
eval-framework inspect | --run-id --show-failures | Per-case input + expected + actual + judge verdict |
eval-framework kappa | --task --run-ids | Pairwise κ matrix across selected runs |
Adapter implementation (illustrative)
# adapters/xai_adapter.py
class XaiAdapter:
provider = "xai"
def __init__(self, model_id: str):
from xai_sdk import AsyncClient
self.model_id = model_id
self._client = AsyncClient()
self._price_in, self._price_out = PRICES[("xai", model_id)]
async def complete(self, system_prompt, user_message, max_tokens=2048, temperature=0.0):
t0 = time.perf_counter()
resp = await self._client.chat.completions.create(
model=self.model_id,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
],
max_tokens=max_tokens,
temperature=temperature,
)
latency_ms = int((time.perf_counter() - t0) * 1000)
text = resp.choices[0].message.content or ""
in_tok = resp.usage.prompt_tokens
out_tok = resp.usage.completion_tokens
cost = in_tok / 1e6 * self._price_in + out_tok / 1e6 * self._price_out
return CompletionResult(text, in_tok, out_tok, latency_ms, cost)
All 4 adapters follow this shape. Gemini adapter additionally guards resp.candidates (empty array = real failure mode observed during F3).
Scoring rubric (LLM-judge)
The Knowledge-Audit task ships with this rubric (Sonnet 4.6 verdict prompt):
You are evaluating an auditor's finding against source documents.
A finding is VALID if and only if:
- It describes a real contradiction between two sources
- The quoted text appears verbatim in the cited source
- It is not a paraphrase, summary, or over-inference
A finding is INVALID if:
- The quoted text is not verbatim
- The two sources actually agree (paraphrase, not contradiction)
- The finding is speculative or adds claims not in sources
Sources:
{sources}
Finding:
{finding_json}
Output exactly one token: VALID or INVALID
Verdicts cached by (sha256(finding_json), case_id) → re-runs across prompt iteration loops are free.
Statistics
Bootstrap CI
def bootstrap_ci(pass_vec: list[bool], n_resamples: int = 1000, alpha: float = 0.05):
arr = np.array(pass_vec, dtype=int)
n = len(arr)
accs = [arr[np.random.randint(0, n, size=n)].mean() for _ in range(n_resamples)]
return np.percentile(accs, [alpha / 2 * 100, (1 - alpha / 2) * 100])
Pairwise Cohen’s κ
def kappa_matrix(pass_vectors: dict[str, list[bool]]) -> dict[tuple[str, str], float]:
judges = list(pass_vectors)
out = {}
for i, a in enumerate(judges):
for b in judges[i + 1:]:
k = cohen_kappa_score(pass_vectors[a], pass_vectors[b])
out[(a, b)] = 1.0 if math.isnan(k) else k # NaN floor: perfect agreement
return out
Observed κ values from the F3 bake-off:
| Pair | κ | Note |
|---|---|---|
| Sonnet ↔ Opus | 0.66 | Same Anthropic family → correlated bias |
| Grok ↔ Sonnet | 0.52 | Independent providers |
| Grok ↔ OpenAI | 0.55 | Independent |
| Haiku ↔ Grok | −0.40 | Anti-correlated — opposite errors |
→ Ensembles avoid same-family pairs. This insight came from κ; aggregate accuracy would not have revealed it.
Bake-off sequence (Knowledge-Audit, F3 actual run)
sequenceDiagram
autonumber
participant Op as Operator
participant CLI as eval-framework CLI
participant R as Runner
participant A1 as Anthropic Adapter
participant A2 as xAI Adapter
participant A3 as OpenAI Adapter
participant A4 as Gemini Adapter
participant S as Scorer (LLM-judge)
participant J as Sonnet 4.6 (verdict)
participant DB as Postgres
Op->>CLI: bake-off --task knowledge_audit
--judges grok-4.3,haiku-4.5,sonnet-4.6,opus-4.7,gpt-5.4-mini,gemini-flash
--eval-set dev-93
CLI->>R: load task + eval set (93 cases)
R->>R: holdout guard: 'dev-93' OK
par parallel fan-out (sem=8)
R->>A1: complete × (3 models × 93 cases)
R->>A2: complete × (2 models × 93 cases)
R->>A3: complete × (1 model × 93 cases)
R->>A4: complete × (1 model × 93 cases)
end
A1-->>R: raw outputs + tokens + cost
A2-->>R: raw outputs + tokens + cost
A3-->>R: raw outputs + tokens + cost
A4-->>R: empty candidates × 93 ⚠
R->>S: score(all raw outputs)
loop per finding
S->>J: VALID or INVALID verdict
J-->>S: VALID/INVALID
Note over S,J: verdict cached by (finding_hash, case_id)
end
S-->>R: pass_vector per judge
R->>R: bootstrap CI (1000 resamples)
R->>R: pairwise Cohen's κ
R->>R: per-stratum aggregation
R->>DB: INSERT eval_runs × 7 + eval_kappa × C(7,2)
R-->>Op: rich table + κ matrix + cost summary
Op->>CLI: bake-off --judges grok-4.3
--eval-set holdout-99 --final-decision
Note over CLI,Op: Guard PASSES (flag present)
CLI->>R: same flow, single judge
R->>DB: INSERT eval_runs (run_type='final-decision')
R->>R: append (sha, judge, run_id) to holdout-runs.log
R-->>Op: 99.0% LLM-judge / 79.8% strict / $0.61/mo
Performance numbers
Measured on MacBook Pro M2 Max during F3 (2026-05-21):
| Operation | Number | Notes |
|---|---|---|
| Bake-off wall time (6 judges × N=93 dev) | ~12 min | Parallel fan-out, semaphore=8 |
| Bake-off wall time (6 judges × N=30 dev) | ~4 min | Weekly drift baseline |
| Bake-off cost (6 judges × N=30 dev) | ~$3 | Anthropic + xAI + OpenAI; Gemini free tier |
| Holdout-99 single-judge cost | ~$0.50 | + ~$0.15 LLM-judge scoring |
| LLM-judge scoring overhead | ~30% on top of generation | Sonnet 4.6 verdict per finding |
| Verdict cache hit rate (re-scoring iteration) | >90% | After 2nd run on same eval set |
| Production cost — Grok 4.3 audit volume | $0.61/month | At personal Knowledge-Audit volume |
| Time to first verified production decision | 1 day | F1+F2+F3 + holdout confirmation |
| 192-case stratified set bootstrap | ~25 min wall | Haiku one-shot generation + hand-balance |
Retrieval quality measured via F4 application (Personal-RAG, 93-query held-out, 2026-05-21):
| Stage | Hit@1 | Hit@3 | MRR |
|---|---|---|---|
| bge-m3 only | 86.0% | 97.8% | 0.918 |
| bge-m3 + bge-reranker-v2-m3 | 89.2% | 97.8% | 0.948 |
The reranker swap was shipped as default because Eval-Framework measured a +3.2pp Hit@1 lift at negligible latency cost.
Knowledge-Audit production decision (F3, 2026-05-21):
| Judge | Accuracy (dev-93) | Holdout-99 | Cost/case | p95 ms | Verdict |
|---|---|---|---|---|---|
| Grok 4.3 | 83.3% | 99.0% (LLM-judge) / 79.8% (strict) | $0.0021 | 1840 | PRODUCTION |
| Opus 4.7 | 66.7% | — | $0.0089 | 2110 | too expensive |
| Sonnet 4.6 | 60.0% | — | $0.0034 | 1320 | beat by Grok |
| GPT-5.4-mini | 53.3% | — | $0.0011 | 980 | cheapest but accuracy gap |
| Grok 4-fast-reasoning | 50.0% | — | $0.0008 | 720 | overlapping CI with GPT |
| Haiku 4.5 (as direct auditor) | 13.3% | abandoned | $0.0004 | 460 | conservative-bias collapse |
| Gemini 2.5 Flash | 0.0% (99/99 empty) | — | $0 | 510 | API integration failure |
Reliability features
| Feature | How |
|---|---|
| Frozen-holdout guard | CLI refuses non---final-decision runs on holdout-* |
| Tamper log | Each --final-decision run appends to holdout-runs.log |
| Idempotent scoring | Verdict cache by (finding_hash, case_id) |
| Adapter retry | Provider-native retry × 3 with exp backoff |
| Provider sanity-check | eval-framework verify --smoke runs 1 case per judge before bake-off |
| Audit log | Every run writes eval_runs row + git SHA |
| Drift monitor | Weekly launchd cron + Telegram P0 on ≥5pp regression |
| Cohen’s κ NaN floor | Perfect-agreement degenerate case clamped at 1.0 with note |
| Bootstrap CI cap | Resamples capped at 1000 to bound wall time |
Security model
| Threat | Mitigation |
|---|---|
| API key leak | Keys in ~/.config/eval-framework/.env, gitignored; never logged |
| Production data in eval cases | Cases auto-scrubbed by regex + PII LLM classifier at ingest; PR review gate |
| Holdout overfit | CLI guard + tamper log + loud overfit warning on repeat runs |
| Judge model bias | Pairwise κ surfaces correlated bias; report flags κ > 0.6 same-provider pairs |
| Cost runaway | Per-run cost cap (--max-cost-usd, default $5); CLI aborts if projected cost exceeds |
| Drift false positive | Alert text includes run-id; operator confirms with full holdout-99 before acting |
| Provider outage during bake-off | Other judges still complete; report flags partial run |
Reproducibility — quickstart for a forker
# 1. Clone + bootstrap
git clone <your-fork>/eval-framework
cd eval-framework
python3.11 -m venv venv
./venv/bin/pip install -e . # installs `eval-framework` CLI
# 2. Provider keys
cat > ~/.config/eval-framework/.env <<EOF
ANTHROPIC_API_KEY=sk-ant-...
XAI_API_KEY=xai-...
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
EOF
# 3. Postgres audit log (shared with Personal-RAG OK)
psql -d ragkb -f schema/eval_runs.sql
# 4. Bring your own task
cp tasks/knowledge_audit.yaml tasks/my_task.yaml
$EDITOR tasks/my_task.yaml # set system_prompt + rubric + judge defaults
# 5. Bring your own eval set (start with 30 cases hand-curated)
cp eval/knowledge_audit/dev-93.yaml eval/my_task/dev-30.yaml
$EDITOR eval/my_task/dev-30.yaml
# 6. Smoke test (1 case per judge)
eval-framework verify --task my_task --judges grok-4.3,claude-haiku-4.5 --smoke
# 7. Bake-off
eval-framework bake-off --task my_task \
--judges grok-4.3,claude-haiku-4.5,claude-sonnet-4.6,gpt-5.4-mini \
--eval-set dev-30 --scorer llm_judge
# 8. After iteration, freeze a holdout
mv eval/my_task/holdout-candidates.yaml eval/my_task/holdout-99.yaml
# 9. One-pass final decision
eval-framework bake-off --task my_task --judges <winner> \
--eval-set holdout-99 --final-decision
# 10. Install drift cron
cp launchd/ai.eval-framework.drift-check.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/ai.eval-framework.drift-check.plist
Total: 1-2 hours to first verified production decision once Postgres is ready.
Future work
- Local MLX adapter wired — retry distill on weekend with clean GPU state; if 70%+ → tier-1 daily judge (cost-free)
- 4-tier triage routing in production: daily=4B+Haiku verifier, weekly=8B+Haiku, monthly=32B alone, on-demand=Grok 4.3
- Promptfoo CI integration — deferred unless GitHub Actions wrap becomes worthwhile
- Multi-task bake-off — single command runs the same judge list across all 8 portfolio tasks → portfolio-wide cost-quality frontier
- Self-consistency / adaptive escalation — designed (Decision #4 in notes) but bad ROI at current volume; revisit if production volume ≥10× current
- Hand-curated eval cases — replace Haiku-bootstrapped cases gradually with hand-curated ones for higher signal density
License & attribution
Personal project. Built on:
- Anthropic SDK
- xAI SDK
- OpenAI SDK
- Google Gen AI SDK
- scikit-learn (
cohen_kappa_score) - numpy (bootstrap)
- click, pydantic, rich