← Back to project
● Shipped P0 Size M Foundation

Eval-Framework — Implementation

Tech stack deep-dive: code structure, schema, scoring rubric, statistics, performance, reproducibility.

Implementation

Sister docs: PRD (intent), Architecture (system view), Notes (decision log).

TL;DR

A production-grade personal eval framework in continuous use across the side-project portfolio:

  • 6-model bake-off shipped (Claude Haiku 4.5, Sonnet 4.6, Opus 4.7, Grok 4-fast-reasoning, Grok 4.3, GPT-5.4-mini, Gemini 2.5 Flash)
  • Stratified 192-case eval set (5 buckets × 3 languages); 93 dev + 99 holdout (FROZEN)
  • LLM-as-judge scorer (Grok 4.3 default) — replaces strict substring on multi-valid-output tasks; verified +19.2pp accuracy gap on holdout-99
  • Production judge picked: Grok 4.3 wins Knowledge-Audit at 99.0% on holdout-99, $0.61/mo at audit volume
  • Statistics: bootstrap CI 1000 resamples (95%) + pairwise Cohen’s κ for correlated-bias detection
  • CLI: eval-framework bake-off | score | verify + Python API for notebooks
  • F1+F2+F3 shipped in a 10-hour single-day sprint (2026-05-21)
  • Applied to 2 projects so far: Knowledge-Audit (production) + Personal-RAG (Hit@3=97.8%, MRR=0.948 verified 2026-05-21); 6 more queued
  • Drift monitor: weekly launchd cron + Telegram P0 alert on ≥5pp regression
  • Audit log: Postgres eval_runs table shared with Personal-RAG infra

Stack

LayerComponentVersion / Notes
RuntimePython3.11 + venv
CLIclicksub-commands: bake-off, score, verify, inspect
Config schemaPydanticv2; YAML loader via pyyaml
Async runtimeasyncio + anyiosemaphore-bounded concurrency=8
Provider SDKsanthropic · xai-sdk · openai · google-genaiofficial clients per provider
Statisticsnumpy + scikit-learnbootstrap resample; cohen_kappa_score
Audit logPostgres 16shared with Personal-RAG; table eval_runs
Schedulerlaunchdai.eval-framework.drift-check.plist (weekly Sat 03:00)
AlertingTelegram Bot APIeval-framework-bot
Test runnerpytestparametrized over eval cases for ad-hoc inspection

Directory layout

Repo (~/Documents/Side.Projects/eval-framework/)

src/eval_framework/
├── cli.py                       # click entrypoints
├── runner.py                    # asyncio fan-out + retry + holdout guard
├── config.py                    # Pydantic schemas (TaskConfig, EvalCase, ...)
├── adapters/
│   ├── base.py                  # JudgeAdapter protocol + CompletionResult
│   ├── anthropic_adapter.py     # ~35 LOC
│   ├── xai_adapter.py           # ~30 LOC
│   ├── openai_adapter.py        # ~30 LOC
│   ├── gemini_adapter.py        # ~40 LOC (handles empty candidates)
│   └── mlx_adapter.py           # config-shaped stub, deferred wire-up
├── scoring/
│   ├── strict.py                # substring match
│   ├── llm_judge.py             # judge call + verdict cache
│   └── rubrics.py               # built-in rubric templates
├── stats/
│   ├── bootstrap.py             # 1000-resample CI
│   ├── kappa.py                 # pairwise Cohen's κ matrix
│   └── stratify.py              # per-(bucket, lang) aggregation
├── reporter.py                  # rich-table render + Postgres write
├── prices.py                    # static $/Mtok table per (provider, model)
└── drift.py                     # weekly verify cron entry

tasks/
├── knowledge_audit.yaml         # production task
├── personal_rag_retrieval.yaml  # F4 application
├── mail_classify.yaml           # queued
└── voice_tool_use.yaml          # queued

eval/
├── knowledge_audit/
│   ├── dev-93.yaml              # iterable
│   ├── holdout-99.yaml          # FROZEN
│   ├── holdout-30-sample.yaml   # weekly drift subset
│   └── holdout-runs.log         # tamper-evident
└── personal_rag/
    └── personal-93.yaml         # 93-query held-out personal eval

schema/
└── eval_runs.sql

launchd/
└── ai.eval-framework.drift-check.plist

Schema

CREATE TABLE eval_runs (
    id              BIGSERIAL PRIMARY KEY,
    task_name       TEXT NOT NULL,
    eval_set_name   TEXT NOT NULL,             -- 'dev-93' | 'holdout-99' | ...
    judge_model     TEXT NOT NULL,             -- 'grok-4.3' | 'claude-haiku-4.5' | ...
    scorer_mode     TEXT NOT NULL,             -- 'llm_judge' | 'strict_substring'
    judge_for_scoring TEXT,                    -- 'claude-sonnet-4-6' | 'grok-4.3' | NULL
    n_cases         INT NOT NULL,
    n_pass          INT NOT NULL,
    accuracy        NUMERIC(5,4) NOT NULL,     -- 0.0000 - 1.0000
    ci_low          NUMERIC(5,4),
    ci_high         NUMERIC(5,4),
    p95_latency_ms  INT,
    total_cost_usd  NUMERIC(8,4),
    per_stratum     JSONB,                     -- {"positive_explicit_VN": 0.88, ...}
    run_type        TEXT NOT NULL,             -- 'bake-off' | 'final-decision' | 'drift_check'
    git_sha         TEXT,
    started_at      TIMESTAMPTZ NOT NULL,
    finished_at     TIMESTAMPTZ NOT NULL,
    notes           TEXT
);
CREATE INDEX idx_eval_runs_task ON eval_runs(task_name, finished_at DESC);
CREATE INDEX idx_eval_runs_drift ON eval_runs(task_name, judge_model, run_type, finished_at DESC);

CREATE TABLE eval_kappa (
    eval_run_id_a   BIGINT REFERENCES eval_runs(id),
    eval_run_id_b   BIGINT REFERENCES eval_runs(id),
    kappa           NUMERIC(5,4) NOT NULL,
    PRIMARY KEY (eval_run_id_a, eval_run_id_b)
);

Why this shape:

  • One row per (judge × eval-set × run) = grain matches the reporter table
  • per_stratum JSONB → flexible without per-task schema churn
  • (task_name, judge_model, run_type) index → drift cron queries last-4-week median in milliseconds
  • eval_kappa separate table → avoids N² explosion in eval_runs
  • git_sha → reproduce any historical run

Task config schema (Pydantic)

class ScoringConfig(BaseModel):
    mode: Literal["llm_judge", "strict_substring"]
    judge_model: str | None = None                 # required if mode == llm_judge
    rubric: str | None = None                      # required if mode == llm_judge
    pass_rule: Literal[
        "exact_match",
        "any_substring",
        "positive_at_least_one_valid_or_empty_on_negative",
    ] = "exact_match"

class TaskConfig(BaseModel):
    name: str
    description: str
    system_prompt: str
    user_template: str                              # str.format-style
    scoring: ScoringConfig
    default_judges: list[str] = []
    max_tokens: int = 2048
    temperature: float = 0.0

class EvalCase(BaseModel):
    id: str
    inputs: dict[str, Any]                          # interpolated into user_template
    expected: dict[str, Any] | None = None
    stratum: dict[str, str]                         # {"bucket": "positive_subtle", "lang": "VN"}
    expected_type: Literal["positive", "negative"] = "positive"

MCP / CLI surface

CommandArgsPurpose
eval-framework bake-off--task --judges --eval-set --scorerMulti-judge bake-off on dev set; renders table; writes audit row per judge
eval-framework bake-off --final-decision+ --eval-set holdout-*One-pass holdout run; appends to holdout-runs.log
eval-framework score--run-id --rescoreRe-score existing raw outputs with a different scorer (cache hit)
eval-framework verify--task --judges --eval-setDrift check; quiet stdout, exits non-zero on regression
eval-framework inspect--run-id --show-failuresPer-case input + expected + actual + judge verdict
eval-framework kappa--task --run-idsPairwise κ matrix across selected runs

Adapter implementation (illustrative)

# adapters/xai_adapter.py
class XaiAdapter:
    provider = "xai"

    def __init__(self, model_id: str):
        from xai_sdk import AsyncClient
        self.model_id = model_id
        self._client = AsyncClient()
        self._price_in, self._price_out = PRICES[("xai", model_id)]

    async def complete(self, system_prompt, user_message, max_tokens=2048, temperature=0.0):
        t0 = time.perf_counter()
        resp = await self._client.chat.completions.create(
            model=self.model_id,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_message},
            ],
            max_tokens=max_tokens,
            temperature=temperature,
        )
        latency_ms = int((time.perf_counter() - t0) * 1000)
        text = resp.choices[0].message.content or ""
        in_tok = resp.usage.prompt_tokens
        out_tok = resp.usage.completion_tokens
        cost = in_tok / 1e6 * self._price_in + out_tok / 1e6 * self._price_out
        return CompletionResult(text, in_tok, out_tok, latency_ms, cost)

All 4 adapters follow this shape. Gemini adapter additionally guards resp.candidates (empty array = real failure mode observed during F3).

Scoring rubric (LLM-judge)

The Knowledge-Audit task ships with this rubric (Sonnet 4.6 verdict prompt):

You are evaluating an auditor's finding against source documents.

A finding is VALID if and only if:
  - It describes a real contradiction between two sources
  - The quoted text appears verbatim in the cited source
  - It is not a paraphrase, summary, or over-inference

A finding is INVALID if:
  - The quoted text is not verbatim
  - The two sources actually agree (paraphrase, not contradiction)
  - The finding is speculative or adds claims not in sources

Sources:
{sources}

Finding:
{finding_json}

Output exactly one token: VALID or INVALID

Verdicts cached by (sha256(finding_json), case_id) → re-runs across prompt iteration loops are free.

Statistics

Bootstrap CI

def bootstrap_ci(pass_vec: list[bool], n_resamples: int = 1000, alpha: float = 0.05):
    arr = np.array(pass_vec, dtype=int)
    n = len(arr)
    accs = [arr[np.random.randint(0, n, size=n)].mean() for _ in range(n_resamples)]
    return np.percentile(accs, [alpha / 2 * 100, (1 - alpha / 2) * 100])

Pairwise Cohen’s κ

def kappa_matrix(pass_vectors: dict[str, list[bool]]) -> dict[tuple[str, str], float]:
    judges = list(pass_vectors)
    out = {}
    for i, a in enumerate(judges):
        for b in judges[i + 1:]:
            k = cohen_kappa_score(pass_vectors[a], pass_vectors[b])
            out[(a, b)] = 1.0 if math.isnan(k) else k     # NaN floor: perfect agreement
    return out

Observed κ values from the F3 bake-off:

PairκNote
Sonnet ↔ Opus0.66Same Anthropic family → correlated bias
Grok ↔ Sonnet0.52Independent providers
Grok ↔ OpenAI0.55Independent
Haiku ↔ Grok−0.40Anti-correlated — opposite errors

Ensembles avoid same-family pairs. This insight came from κ; aggregate accuracy would not have revealed it.

Bake-off sequence (Knowledge-Audit, F3 actual run)

sequenceDiagram
    autonumber
    participant Op as Operator
    participant CLI as eval-framework CLI
    participant R as Runner
    participant A1 as Anthropic Adapter
    participant A2 as xAI Adapter
    participant A3 as OpenAI Adapter
    participant A4 as Gemini Adapter
    participant S as Scorer (LLM-judge)
    participant J as Sonnet 4.6 (verdict)
    participant DB as Postgres

    Op->>CLI: bake-off --task knowledge_audit
--judges grok-4.3,haiku-4.5,sonnet-4.6,opus-4.7,gpt-5.4-mini,gemini-flash
--eval-set dev-93 CLI->>R: load task + eval set (93 cases) R->>R: holdout guard: 'dev-93' OK par parallel fan-out (sem=8) R->>A1: complete × (3 models × 93 cases) R->>A2: complete × (2 models × 93 cases) R->>A3: complete × (1 model × 93 cases) R->>A4: complete × (1 model × 93 cases) end A1-->>R: raw outputs + tokens + cost A2-->>R: raw outputs + tokens + cost A3-->>R: raw outputs + tokens + cost A4-->>R: empty candidates × 93 ⚠ R->>S: score(all raw outputs) loop per finding S->>J: VALID or INVALID verdict J-->>S: VALID/INVALID Note over S,J: verdict cached by (finding_hash, case_id) end S-->>R: pass_vector per judge R->>R: bootstrap CI (1000 resamples) R->>R: pairwise Cohen's κ R->>R: per-stratum aggregation R->>DB: INSERT eval_runs × 7 + eval_kappa × C(7,2) R-->>Op: rich table + κ matrix + cost summary Op->>CLI: bake-off --judges grok-4.3
--eval-set holdout-99 --final-decision Note over CLI,Op: Guard PASSES (flag present) CLI->>R: same flow, single judge R->>DB: INSERT eval_runs (run_type='final-decision') R->>R: append (sha, judge, run_id) to holdout-runs.log R-->>Op: 99.0% LLM-judge / 79.8% strict / $0.61/mo

Performance numbers

Measured on MacBook Pro M2 Max during F3 (2026-05-21):

OperationNumberNotes
Bake-off wall time (6 judges × N=93 dev)~12 minParallel fan-out, semaphore=8
Bake-off wall time (6 judges × N=30 dev)~4 minWeekly drift baseline
Bake-off cost (6 judges × N=30 dev)~$3Anthropic + xAI + OpenAI; Gemini free tier
Holdout-99 single-judge cost~$0.50+ ~$0.15 LLM-judge scoring
LLM-judge scoring overhead~30% on top of generationSonnet 4.6 verdict per finding
Verdict cache hit rate (re-scoring iteration)>90%After 2nd run on same eval set
Production cost — Grok 4.3 audit volume$0.61/monthAt personal Knowledge-Audit volume
Time to first verified production decision1 dayF1+F2+F3 + holdout confirmation
192-case stratified set bootstrap~25 min wallHaiku one-shot generation + hand-balance

Retrieval quality measured via F4 application (Personal-RAG, 93-query held-out, 2026-05-21):

StageHit@1Hit@3MRR
bge-m3 only86.0%97.8%0.918
bge-m3 + bge-reranker-v2-m389.2%97.8%0.948

The reranker swap was shipped as default because Eval-Framework measured a +3.2pp Hit@1 lift at negligible latency cost.

Knowledge-Audit production decision (F3, 2026-05-21):

JudgeAccuracy (dev-93)Holdout-99Cost/casep95 msVerdict
Grok 4.383.3%99.0% (LLM-judge) / 79.8% (strict)$0.00211840PRODUCTION
Opus 4.766.7%$0.00892110too expensive
Sonnet 4.660.0%$0.00341320beat by Grok
GPT-5.4-mini53.3%$0.0011980cheapest but accuracy gap
Grok 4-fast-reasoning50.0%$0.0008720overlapping CI with GPT
Haiku 4.5 (as direct auditor)13.3%abandoned$0.0004460conservative-bias collapse
Gemini 2.5 Flash0.0% (99/99 empty)$0510API integration failure

Reliability features

FeatureHow
Frozen-holdout guardCLI refuses non---final-decision runs on holdout-*
Tamper logEach --final-decision run appends to holdout-runs.log
Idempotent scoringVerdict cache by (finding_hash, case_id)
Adapter retryProvider-native retry × 3 with exp backoff
Provider sanity-checkeval-framework verify --smoke runs 1 case per judge before bake-off
Audit logEvery run writes eval_runs row + git SHA
Drift monitorWeekly launchd cron + Telegram P0 on ≥5pp regression
Cohen’s κ NaN floorPerfect-agreement degenerate case clamped at 1.0 with note
Bootstrap CI capResamples capped at 1000 to bound wall time

Security model

ThreatMitigation
API key leakKeys in ~/.config/eval-framework/.env, gitignored; never logged
Production data in eval casesCases auto-scrubbed by regex + PII LLM classifier at ingest; PR review gate
Holdout overfitCLI guard + tamper log + loud overfit warning on repeat runs
Judge model biasPairwise κ surfaces correlated bias; report flags κ > 0.6 same-provider pairs
Cost runawayPer-run cost cap (--max-cost-usd, default $5); CLI aborts if projected cost exceeds
Drift false positiveAlert text includes run-id; operator confirms with full holdout-99 before acting
Provider outage during bake-offOther judges still complete; report flags partial run

Reproducibility — quickstart for a forker

# 1. Clone + bootstrap
git clone <your-fork>/eval-framework
cd eval-framework
python3.11 -m venv venv
./venv/bin/pip install -e .       # installs `eval-framework` CLI

# 2. Provider keys
cat > ~/.config/eval-framework/.env <<EOF
ANTHROPIC_API_KEY=sk-ant-...
XAI_API_KEY=xai-...
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
EOF

# 3. Postgres audit log (shared with Personal-RAG OK)
psql -d ragkb -f schema/eval_runs.sql

# 4. Bring your own task
cp tasks/knowledge_audit.yaml tasks/my_task.yaml
$EDITOR tasks/my_task.yaml         # set system_prompt + rubric + judge defaults

# 5. Bring your own eval set (start with 30 cases hand-curated)
cp eval/knowledge_audit/dev-93.yaml eval/my_task/dev-30.yaml
$EDITOR eval/my_task/dev-30.yaml

# 6. Smoke test (1 case per judge)
eval-framework verify --task my_task --judges grok-4.3,claude-haiku-4.5 --smoke

# 7. Bake-off
eval-framework bake-off --task my_task \
  --judges grok-4.3,claude-haiku-4.5,claude-sonnet-4.6,gpt-5.4-mini \
  --eval-set dev-30 --scorer llm_judge

# 8. After iteration, freeze a holdout
mv eval/my_task/holdout-candidates.yaml eval/my_task/holdout-99.yaml

# 9. One-pass final decision
eval-framework bake-off --task my_task --judges <winner> \
  --eval-set holdout-99 --final-decision

# 10. Install drift cron
cp launchd/ai.eval-framework.drift-check.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/ai.eval-framework.drift-check.plist

Total: 1-2 hours to first verified production decision once Postgres is ready.

Future work

  • Local MLX adapter wired — retry distill on weekend with clean GPU state; if 70%+ → tier-1 daily judge (cost-free)
  • 4-tier triage routing in production: daily=4B+Haiku verifier, weekly=8B+Haiku, monthly=32B alone, on-demand=Grok 4.3
  • Promptfoo CI integration — deferred unless GitHub Actions wrap becomes worthwhile
  • Multi-task bake-off — single command runs the same judge list across all 8 portfolio tasks → portfolio-wide cost-quality frontier
  • Self-consistency / adaptive escalation — designed (Decision #4 in notes) but bad ROI at current volume; revisit if production volume ≥10× current
  • Hand-curated eval cases — replace Haiku-bootstrapped cases gradually with hand-curated ones for higher signal density

License & attribution

Personal project. Built on: