Implementation

Sister docs: PRD (intent), Architecture (system view), Notes (decision log).

TL;DR

A production-grade personal eval framework in continuous use across the side-project portfolio:

6-model bake-off shipped (Claude Haiku 4.5, Sonnet 4.6, Opus 4.7, Grok 4-fast-reasoning, Grok 4.3, GPT-5.4-mini, Gemini 2.5 Flash)
Stratified 192-case eval set (5 buckets × 3 languages); 93 dev + 99 holdout (FROZEN)
LLM-as-judge scorer (Grok 4.3 default) — replaces strict substring on multi-valid-output tasks; verified +19.2pp accuracy gap on holdout-99
Production judge picked: Grok 4.3 wins Knowledge-Audit at 99.0% on holdout-99, $0.61/mo at audit volume
Statistics: bootstrap CI 1000 resamples (95%) + pairwise Cohen’s κ for correlated-bias detection
CLI: eval-framework bake-off | score | verify + Python API for notebooks
F1+F2+F3 shipped in a 10-hour single-day sprint (2026-05-21)
Applied to 2 projects so far: Knowledge-Audit (production) + Personal-RAG (Hit@3=97.8%, MRR=0.948 verified 2026-05-21); 6 more queued
Drift monitor: weekly launchd cron + Telegram P0 alert on ≥5pp regression
Audit log: Postgres eval_runs table shared with Personal-RAG infra

Stack

Layer	Component	Version / Notes
Runtime	Python	3.11 + venv
CLI	`click`	sub-commands: `bake-off`, `score`, `verify`, `inspect`
Config schema	Pydantic	v2; YAML loader via `pyyaml`
Async runtime	`asyncio` + `anyio`	semaphore-bounded concurrency=8
Provider SDKs	`anthropic` · `xai-sdk` · `openai` · `google-genai`	official clients per provider
Statistics	`numpy` + `scikit-learn`	bootstrap resample; `cohen_kappa_score`
Audit log	Postgres 16	shared with Personal-RAG; table `eval_runs`
Scheduler	`launchd`	`ai.eval-framework.drift-check.plist` (weekly Sat 03:00)
Alerting	Telegram Bot API	`eval-framework-bot`
Test runner	`pytest`	parametrized over eval cases for ad-hoc inspection

Directory layout

Repo (`~/Documents/Side.Projects/eval-framework/`)

src/eval_framework/
├── cli.py                       # click entrypoints
├── runner.py                    # asyncio fan-out + retry + holdout guard
├── config.py                    # Pydantic schemas (TaskConfig, EvalCase, ...)
├── adapters/
│   ├── base.py                  # JudgeAdapter protocol + CompletionResult
│   ├── anthropic_adapter.py     # ~35 LOC
│   ├── xai_adapter.py           # ~30 LOC
│   ├── openai_adapter.py        # ~30 LOC
│   ├── gemini_adapter.py        # ~40 LOC (handles empty candidates)
│   └── mlx_adapter.py           # config-shaped stub, deferred wire-up
├── scoring/
│   ├── strict.py                # substring match
│   ├── llm_judge.py             # judge call + verdict cache
│   └── rubrics.py               # built-in rubric templates
├── stats/
│   ├── bootstrap.py             # 1000-resample CI
│   ├── kappa.py                 # pairwise Cohen's κ matrix
│   └── stratify.py              # per-(bucket, lang) aggregation
├── reporter.py                  # rich-table render + Postgres write
├── prices.py                    # static $/Mtok table per (provider, model)
└── drift.py                     # weekly verify cron entry

tasks/
├── knowledge_audit.yaml         # production task
├── personal_rag_retrieval.yaml  # F4 application
├── mail_classify.yaml           # queued
└── voice_tool_use.yaml          # queued

eval/
├── knowledge_audit/
│   ├── dev-93.yaml              # iterable
│   ├── holdout-99.yaml          # FROZEN
│   ├── holdout-30-sample.yaml   # weekly drift subset
│   └── holdout-runs.log         # tamper-evident
└── personal_rag/
    └── personal-93.yaml         # 93-query held-out personal eval

schema/
└── eval_runs.sql

launchd/
└── ai.eval-framework.drift-check.plist

Schema

CREATE TABLE eval_runs (
    id              BIGSERIAL PRIMARY KEY,
    task_name       TEXT NOT NULL,
    eval_set_name   TEXT NOT NULL,             -- 'dev-93' | 'holdout-99' | ...
    judge_model     TEXT NOT NULL,             -- 'grok-4.3' | 'claude-haiku-4.5' | ...
    scorer_mode     TEXT NOT NULL,             -- 'llm_judge' | 'strict_substring'
    judge_for_scoring TEXT,                    -- 'claude-sonnet-4-6' | 'grok-4.3' | NULL
    n_cases         INT NOT NULL,
    n_pass          INT NOT NULL,
    accuracy        NUMERIC(5,4) NOT NULL,     -- 0.0000 - 1.0000
    ci_low          NUMERIC(5,4),
    ci_high         NUMERIC(5,4),
    p95_latency_ms  INT,
    total_cost_usd  NUMERIC(8,4),
    per_stratum     JSONB,                     -- {"positive_explicit_VN": 0.88, ...}
    run_type        TEXT NOT NULL,             -- 'bake-off' | 'final-decision' | 'drift_check'
    git_sha         TEXT,
    started_at      TIMESTAMPTZ NOT NULL,
    finished_at     TIMESTAMPTZ NOT NULL,
    notes           TEXT
);
CREATE INDEX idx_eval_runs_task ON eval_runs(task_name, finished_at DESC);
CREATE INDEX idx_eval_runs_drift ON eval_runs(task_name, judge_model, run_type, finished_at DESC);

CREATE TABLE eval_kappa (
    eval_run_id_a   BIGINT REFERENCES eval_runs(id),
    eval_run_id_b   BIGINT REFERENCES eval_runs(id),
    kappa           NUMERIC(5,4) NOT NULL,
    PRIMARY KEY (eval_run_id_a, eval_run_id_b)
);

Why this shape:

One row per (judge × eval-set × run) = grain matches the reporter table
per_stratum JSONB → flexible without per-task schema churn
(task_name, judge_model, run_type) index → drift cron queries last-4-week median in milliseconds
eval_kappa separate table → avoids N² explosion in eval_runs
git_sha → reproduce any historical run

Task config schema (Pydantic)

class ScoringConfig(BaseModel):
    mode: Literal["llm_judge", "strict_substring"]
    judge_model: str | None = None                 # required if mode == llm_judge
    rubric: str | None = None                      # required if mode == llm_judge
    pass_rule: Literal[
        "exact_match",
        "any_substring",
        "positive_at_least_one_valid_or_empty_on_negative",
    ] = "exact_match"

class TaskConfig(BaseModel):
    name: str
    description: str
    system_prompt: str
    user_template: str                              # str.format-style
    scoring: ScoringConfig
    default_judges: list[str] = []
    max_tokens: int = 2048
    temperature: float = 0.0

class EvalCase(BaseModel):
    id: str
    inputs: dict[str, Any]                          # interpolated into user_template
    expected: dict[str, Any] | None = None
    stratum: dict[str, str]                         # {"bucket": "positive_subtle", "lang": "VN"}
    expected_type: Literal["positive", "negative"] = "positive"

MCP / CLI surface

Command	Args	Purpose
`eval-framework bake-off`	`--task` `--judges` `--eval-set` `--scorer`	Multi-judge bake-off on dev set; renders table; writes audit row per judge
`eval-framework bake-off --final-decision`	+ `--eval-set holdout-*`	One-pass holdout run; appends to `holdout-runs.log`
`eval-framework score`	`--run-id` `--rescore`	Re-score existing raw outputs with a different scorer (cache hit)
`eval-framework verify`	`--task` `--judges` `--eval-set`	Drift check; quiet stdout, exits non-zero on regression
`eval-framework inspect`	`--run-id` `--show-failures`	Per-case input + expected + actual + judge verdict
`eval-framework kappa`	`--task` `--run-ids`	Pairwise κ matrix across selected runs

Adapter implementation (illustrative)

# adapters/xai_adapter.py
class XaiAdapter:
    provider = "xai"

    def __init__(self, model_id: str):
        from xai_sdk import AsyncClient
        self.model_id = model_id
        self._client = AsyncClient()
        self._price_in, self._price_out = PRICES[("xai", model_id)]

    async def complete(self, system_prompt, user_message, max_tokens=2048, temperature=0.0):
        t0 = time.perf_counter()
        resp = await self._client.chat.completions.create(
            model=self.model_id,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_message},
            ],
            max_tokens=max_tokens,
            temperature=temperature,
        )
        latency_ms = int((time.perf_counter() - t0) * 1000)
        text = resp.choices[0].message.content or ""
        in_tok = resp.usage.prompt_tokens
        out_tok = resp.usage.completion_tokens
        cost = in_tok / 1e6 * self._price_in + out_tok / 1e6 * self._price_out
        return CompletionResult(text, in_tok, out_tok, latency_ms, cost)

All 4 adapters follow this shape. Gemini adapter additionally guards resp.candidates (empty array = real failure mode observed during F3).

Scoring rubric (LLM-judge)

The Knowledge-Audit task ships with this rubric (Sonnet 4.6 verdict prompt):

You are evaluating an auditor's finding against source documents.

A finding is VALID if and only if:
  - It describes a real contradiction between two sources
  - The quoted text appears verbatim in the cited source
  - It is not a paraphrase, summary, or over-inference

A finding is INVALID if:
  - The quoted text is not verbatim
  - The two sources actually agree (paraphrase, not contradiction)
  - The finding is speculative or adds claims not in sources

Sources:
{sources}

Finding:
{finding_json}

Output exactly one token: VALID or INVALID

Verdicts cached by (sha256(finding_json), case_id) → re-runs across prompt iteration loops are free.

Statistics

Bootstrap CI

def bootstrap_ci(pass_vec: list[bool], n_resamples: int = 1000, alpha: float = 0.05):
    arr = np.array(pass_vec, dtype=int)
    n = len(arr)
    accs = [arr[np.random.randint(0, n, size=n)].mean() for _ in range(n_resamples)]
    return np.percentile(accs, [alpha / 2 * 100, (1 - alpha / 2) * 100])

Pairwise Cohen’s κ

def kappa_matrix(pass_vectors: dict[str, list[bool]]) -> dict[tuple[str, str], float]:
    judges = list(pass_vectors)
    out = {}
    for i, a in enumerate(judges):
        for b in judges[i + 1:]:
            k = cohen_kappa_score(pass_vectors[a], pass_vectors[b])
            out[(a, b)] = 1.0 if math.isnan(k) else k     # NaN floor: perfect agreement
    return out

Observed κ values from the F3 bake-off:

Pair	κ	Note
Sonnet ↔ Opus	0.66	Same Anthropic family → correlated bias
Grok ↔ Sonnet	0.52	Independent providers
Grok ↔ OpenAI	0.55	Independent
Haiku ↔ Grok	−0.40	Anti-correlated — opposite errors

→ Ensembles avoid same-family pairs. This insight came from κ; aggregate accuracy would not have revealed it.

Bake-off sequence (Knowledge-Audit, F3 actual run)

sequenceDiagram
    autonumber
    participant Op as Operator
    participant CLI as eval-framework CLI
    participant R as Runner
    participant A1 as Anthropic Adapter
    participant A2 as xAI Adapter
    participant A3 as OpenAI Adapter
    participant A4 as Gemini Adapter
    participant S as Scorer (LLM-judge)
    participant J as Sonnet 4.6 (verdict)
    participant DB as Postgres

    Op->>CLI: bake-off --task knowledge_audit
--judges grok-4.3,haiku-4.5,sonnet-4.6,opus-4.7,gpt-5.4-mini,gemini-flash
--eval-set dev-93

    CLI->>R: load task + eval set (93 cases)
    R->>R: holdout guard: 'dev-93' OK
    par parallel fan-out (sem=8)
        R->>A1: complete × (3 models × 93 cases)
        R->>A2: complete × (2 models × 93 cases)
        R->>A3: complete × (1 model × 93 cases)
        R->>A4: complete × (1 model × 93 cases)
    end
    A1-->>R: raw outputs + tokens + cost
    A2-->>R: raw outputs + tokens + cost
    A3-->>R: raw outputs + tokens + cost
    A4-->>R: empty candidates × 93 ⚠
    R->>S: score(all raw outputs)
    loop per finding
        S->>J: VALID or INVALID verdict
        J-->>S: VALID/INVALID
        Note over S,J: verdict cached by (finding_hash, case_id)
    end
    S-->>R: pass_vector per judge
    R->>R: bootstrap CI (1000 resamples)
    R->>R: pairwise Cohen's κ
    R->>R: per-stratum aggregation
    R->>DB: INSERT eval_runs × 7 + eval_kappa × C(7,2)
    R-->>Op: rich table + κ matrix + cost summary

    Op->>CLI: bake-off --judges grok-4.3
--eval-set holdout-99 --final-decision
    Note over CLI,Op: Guard PASSES (flag present)
    CLI->>R: same flow, single judge
    R->>DB: INSERT eval_runs (run_type='final-decision')
    R->>R: append (sha, judge, run_id) to holdout-runs.log
    R-->>Op: 99.0% LLM-judge / 79.8% strict / $0.61/mo

Performance numbers

Measured on MacBook Pro M2 Max during F3 (2026-05-21):

Operation	Number	Notes
Bake-off wall time (6 judges × N=93 dev)	~12 min	Parallel fan-out, semaphore=8
Bake-off wall time (6 judges × N=30 dev)	~4 min	Weekly drift baseline
Bake-off cost (6 judges × N=30 dev)	~$3	Anthropic + xAI + OpenAI; Gemini free tier
Holdout-99 single-judge cost	~$0.50	+ ~$0.15 LLM-judge scoring
LLM-judge scoring overhead	~30% on top of generation	Sonnet 4.6 verdict per finding
Verdict cache hit rate (re-scoring iteration)	>90%	After 2nd run on same eval set
Production cost — Grok 4.3 audit volume	$0.61/month	At personal Knowledge-Audit volume
Time to first verified production decision	1 day	F1+F2+F3 + holdout confirmation
192-case stratified set bootstrap	~25 min wall	Haiku one-shot generation + hand-balance

Retrieval quality measured via F4 application (Personal-RAG, 93-query held-out, 2026-05-21):

Stage	Hit@1	Hit@3	MRR
bge-m3 only	86.0%	97.8%	0.918
bge-m3 + bge-reranker-v2-m3	89.2%	97.8%	0.948

The reranker swap was shipped as default because Eval-Framework measured a +3.2pp Hit@1 lift at negligible latency cost.

Knowledge-Audit production decision (F3, 2026-05-21):

Judge	Accuracy (dev-93)	Holdout-99	Cost/case	p95 ms	Verdict
Grok 4.3	83.3%	99.0% (LLM-judge) / 79.8% (strict)	$0.0021	1840	PRODUCTION
Opus 4.7	66.7%	—	$0.0089	2110	too expensive
Sonnet 4.6	60.0%	—	$0.0034	1320	beat by Grok
GPT-5.4-mini	53.3%	—	$0.0011	980	cheapest but accuracy gap
Grok 4-fast-reasoning	50.0%	—	$0.0008	720	overlapping CI with GPT
Haiku 4.5 (as direct auditor)	13.3%	abandoned	$0.0004	460	conservative-bias collapse
Gemini 2.5 Flash	0.0% (99/99 empty)	—	$0	510	API integration failure

Reliability features

Feature	How
Frozen-holdout guard	CLI refuses non-`--final-decision` runs on `holdout-*`
Tamper log	Each `--final-decision` run appends to `holdout-runs.log`
Idempotent scoring	Verdict cache by `(finding_hash, case_id)`
Adapter retry	Provider-native retry × 3 with exp backoff
Provider sanity-check	`eval-framework verify --smoke` runs 1 case per judge before bake-off
Audit log	Every run writes `eval_runs` row + git SHA
Drift monitor	Weekly launchd cron + Telegram P0 on ≥5pp regression
Cohen’s κ NaN floor	Perfect-agreement degenerate case clamped at 1.0 with note
Bootstrap CI cap	Resamples capped at 1000 to bound wall time

Security model

Threat	Mitigation
API key leak	Keys in `~/.config/eval-framework/.env`, gitignored; never logged
Production data in eval cases	Cases auto-scrubbed by regex + PII LLM classifier at ingest; PR review gate
Holdout overfit	CLI guard + tamper log + loud overfit warning on repeat runs
Judge model bias	Pairwise κ surfaces correlated bias; report flags κ > 0.6 same-provider pairs
Cost runaway	Per-run cost cap (`--max-cost-usd`, default $5); CLI aborts if projected cost exceeds
Drift false positive	Alert text includes run-id; operator confirms with full holdout-99 before acting
Provider outage during bake-off	Other judges still complete; report flags partial run

Reproducibility — quickstart for a forker

# 1. Clone + bootstrap
git clone <your-fork>/eval-framework
cd eval-framework
python3.11 -m venv venv
./venv/bin/pip install -e .       # installs `eval-framework` CLI

# 2. Provider keys
cat > ~/.config/eval-framework/.env <<EOF
ANTHROPIC_API_KEY=sk-ant-...
XAI_API_KEY=xai-...
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
EOF

# 3. Postgres audit log (shared with Personal-RAG OK)
psql -d ragkb -f schema/eval_runs.sql

# 4. Bring your own task
cp tasks/knowledge_audit.yaml tasks/my_task.yaml
$EDITOR tasks/my_task.yaml         # set system_prompt + rubric + judge defaults

# 5. Bring your own eval set (start with 30 cases hand-curated)
cp eval/knowledge_audit/dev-93.yaml eval/my_task/dev-30.yaml
$EDITOR eval/my_task/dev-30.yaml

# 6. Smoke test (1 case per judge)
eval-framework verify --task my_task --judges grok-4.3,claude-haiku-4.5 --smoke

# 7. Bake-off
eval-framework bake-off --task my_task \
  --judges grok-4.3,claude-haiku-4.5,claude-sonnet-4.6,gpt-5.4-mini \
  --eval-set dev-30 --scorer llm_judge

# 8. After iteration, freeze a holdout
mv eval/my_task/holdout-candidates.yaml eval/my_task/holdout-99.yaml

# 9. One-pass final decision
eval-framework bake-off --task my_task --judges <winner> \
  --eval-set holdout-99 --final-decision

# 10. Install drift cron
cp launchd/ai.eval-framework.drift-check.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/ai.eval-framework.drift-check.plist

Total: 1-2 hours to first verified production decision once Postgres is ready.

Future work

Local MLX adapter wired — retry distill on weekend with clean GPU state; if 70%+ → tier-1 daily judge (cost-free)
4-tier triage routing in production: daily=4B+Haiku verifier, weekly=8B+Haiku, monthly=32B alone, on-demand=Grok 4.3
Promptfoo CI integration — deferred unless GitHub Actions wrap becomes worthwhile
Multi-task bake-off — single command runs the same judge list across all 8 portfolio tasks → portfolio-wide cost-quality frontier
Self-consistency / adaptive escalation — designed (Decision #4 in notes) but bad ROI at current volume; revisit if production volume ≥10× current
Hand-curated eval cases — replace Haiku-bootstrapped cases gradually with hand-curated ones for higher signal density

License & attribution

Personal project. Built on:

Eval-Framework — Implementation