Architecture

Sister docs: PRD (intent), Implementation (code deep-dive), Notes (decision log).

System view

flowchart TB
    classDef user fill:#cce0e8,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef core fill:#faedd6,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef adapter fill:#e0d5ed,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef store fill:#f4d6db,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px

    subgraph CLI["👤 Operator (PM)"]
        Cmd["eval-framework bake-off
--task T --judges J1,J2,J3
--eval-set dev-93 --scorer llm_judge"]
        Notebook["Python API
(jupyter notebooks)"]
    end

    Cmd --> Runner
    Notebook --> Runner

    subgraph Core["🏗️ Eval-Framework core"]
        Runner["Runner
(asyncio parallel fan-out + retry)"]
        Config["Task config
(YAML + Pydantic schema)"]
        Scorer["Scorer
llm_judge (default) /
strict_substring"]
        Stats["Stats
bootstrap CI 1000x
Cohen's κ pairwise"]
        Reporter["Reporter
(per-stratum table,
cost, p95 latency)"]
        Runner --> Config
        Runner --> Scorer
        Scorer --> Stats
        Stats --> Reporter
    end

    subgraph Adapters["🔌 Judge adapter layer"]
        AnthAd["Anthropic adapter
(Haiku/Sonnet/Opus 4.x)"]
        XaiAd["xAI adapter
(Grok 4-fast, Grok 4.3)"]
        OaiAd["OpenAI adapter
(GPT-5.4-mini)"]
        GeminiAd["Gemini adapter
(Flash 2.5)"]
        MlxAd["Local MLX adapter
(config-shaped, F7 wired)"]
    end

    Runner -->|parallel| AnthAd
    Runner -->|parallel| XaiAd
    Runner -->|parallel| OaiAd
    Runner -->|parallel| GeminiAd
    Runner -.->|deferred| MlxAd

    Scorer --> JudgeLLM["LLM-Judge
(Sonnet 4.6 default)
VALID / INVALID verdict"]

    subgraph Stores["🗄️ Eval stores"]
        Tasks["tasks/
knowledge_audit.yaml
mail_classify.yaml
..."]
        EvalDev["eval/dev-93.yaml
(iterable)"]
        EvalHold["eval/holdout-99.yaml
FROZEN — one-pass only"]
        PG["Postgres 16
eval_runs audit log
(shared w/ Personal-RAG)"]
    end

    Config -.reads.-> Tasks
    Runner -.reads.-> EvalDev
    Runner -.reads.-> EvalHold
    Reporter -.writes.-> PG

    subgraph Monitor["📡 Drift monitor (launchd weekly)"]
        DriftCron["eval-framework verify
--task T --eval-set holdout-30-sample"]
        Telegram["Telegram bot
P0 alert if Δacc ≥ 5pp"]
        DriftCron --> Telegram
    end

    DriftCron -.uses.-> Adapters

    class Cmd,Notebook user
    class Runner,Config,Scorer,Stats,Reporter core
    class AnthAd,XaiAd,OaiAd,GeminiAd,MlxAd,JudgeLLM adapter
    class Tasks,EvalDev,EvalHold,PG store

The 5-component methodology

The framework is opinionated around 5 components discovered during the F3 bake-off. Each is necessary; dropping any one degrades decision quality:

#	Component	Purpose	Failure mode if dropped
1	Stratified eval set	5 buckets × 3 languages = 15 strata; reveals per-segment bias	Aggregate accuracy hides catastrophic failure modes (e.g. Gemini 99/99 empty)
2	Single task × multi judges	Apples-to-apples comparison with same SYSTEM_PROMPT + user format	Confounded comparisons; can’t attribute delta to model vs prompt
3	LLM-as-judge scorer (for multi-valid-output)	Sonnet 4.6 verdicts VALID/INVALID per finding	Strict substring undercounts (79.8% real → 99.0% true); kills good models
4	Pairwise Cohen’s κ	Detects correlated bias across same-family judges	Ensembles built on Sonnet+Opus = duplicate errors, false confidence
5	Frozen holdout	Held-out queries never used during tuning; final decision = 1 pass	Overfit to dev → production regression at first real query

Pipeline

[0] Author defines task YAML:
    tasks/knowledge_audit.yaml
      name: knowledge_audit
      system_prompt: |
        You are an auditor. Extract claims from <source>...
        Compare across sources. Output JSON list of contradictions or [].
      user_template: "Sources:\n{sources}\n\nFindings (JSON):"
      scoring:
        mode: llm_judge
        judge_model: claude-sonnet-4-6
        rubric: |
          For each finding, output VALID iff it is a real contradiction
          (not paraphrase, not over-inference, quote is verbatim).
        pass_rule: positive_at_least_one_valid_or_empty_on_negative

           │
           ▼
[1] CLI parses:
    eval-framework bake-off --task knowledge_audit
                            --judges grok-4.3,claude-haiku-4.5,claude-opus-4.7,...
                            --eval-set dev-93
                            --scorer llm_judge

           │
           ▼
[2] Runner loads eval set (dev-93.yaml) → list[EvalCase]
    Each case = {id, sources, expected_findings, stratum:{bucket, lang}}

           │
           ▼
[3] Parallel fan-out (asyncio.gather, semaphore concurrency=8):
    for judge in judges:
      for case in cases:
        adapter.complete(system_prompt, user_template.format(sources=case.sources))
        → raw_output

    Retry on transient errors (429, 502, timeout): exp backoff 3 attempts.
    Capture: raw_output, latency_ms, input_tokens, output_tokens, cost.

           │
           ▼
[4] Scorer (mode=llm_judge):
    For each (judge, case):
      Parse raw_output → list[finding]
      For each finding:
        Sonnet 4.6 verdict(finding, case.sources) → VALID | INVALID
      pass = pass_rule(verdicts, case.expected_type)

    Caches Sonnet verdicts by (finding_hash, case_id) → re-runs free.

           │
           ▼
[5] Stats:
    For each judge:
      acc = sum(pass) / N
      ci_95 = bootstrap_resample(pass_vector, 1000)
      per_stratum_acc = group_by(stratum) → acc per (bucket, lang)
      cost_per_case = total_cost / N
      p95_latency = percentile(latencies, 95)

    Cohen's κ pairwise:
      For each (j1, j2):
        κ = cohen_kappa_score(pass_j1, pass_j2)

           │
           ▼
[6] Reporter renders to stdout + writes to Postgres:
    ┌────────────────┬──────┬──────────┬──────────┬─────────┬──────┐
    │ Judge          │ Acc  │ CI 95%   │ $/case   │ p95 ms  │ Rank │
    ├────────────────┼──────┼──────────┼──────────┼─────────┼──────┤
    │ Grok 4.3       │ 83.3%│ [76, 89] │ $0.0021  │ 1840    │  1   │
    │ Opus 4.7       │ 66.7%│ [59, 74] │ $0.0089  │ 2110    │  2   │
    │ Sonnet 4.6     │ 60.0%│ [52, 68] │ $0.0034  │ 1320    │  3   │
    │ GPT-5.4-mini   │ 53.3%│ [45, 61] │ $0.0011  │ 980     │  4   │
    │ Grok 4-fast    │ 50.0%│ [42, 58] │ $0.0008  │ 720     │  5   │
    │ Haiku 4.5      │ 13.3%│ [ 8, 20] │ $0.0004  │ 460     │  6   │
    │ Gemini 2.5 Fl. │ 0.0% │ [ 0,  4] │ $0       │ 510     │  7 ⚠ │
    └────────────────┴──────┴──────────┴──────────┴─────────┴──────┘

    Per-stratum + κ matrix printed below.
    eval_runs.id = 4271 (Postgres audit log).

           │
           ▼
[7] (Operator) Inspect top 2-3 → re-run on holdout-99 ONCE:
    eval-framework bake-off --task knowledge_audit
                            --judges grok-4.3
                            --eval-set holdout-99
                            --final-decision

    → 99.0% LLM-judge / 79.8% strict / $0.61/mo projected production cost
    → Commit decision in project notes; deploy.

Judge adapter pattern

All adapters implement a thin JudgeAdapter protocol:

class JudgeAdapter(Protocol):
    model_id: str
    provider: str

    async def complete(
        self,
        system_prompt: str,
        user_message: str,
        max_tokens: int = 2048,
        temperature: float = 0.0,
    ) -> CompletionResult:
        """Returns (text, input_tokens, output_tokens, latency_ms, cost_usd)."""

Adding a new judge = ~30 LOC: import provider SDK, map kwargs, compute cost from token counts via a static price table. No core framework changes.

Provider quirks handled at adapter layer:

Anthropic — system is a top-level param, not a message
xAI — OpenAI-compatible Chat Completions; temperature=0 allowed
OpenAI GPT-5.4-mini — Responses API; reasoning.effort defaults handled
Gemini — system instruction via system_instruction field; empty candidates array is a real (frequent) failure mode — adapter returns CompletionResult(text="", ...) so scorer can flag

Scoring layer

Two modes, selected per task config:

`strict_substring`

def score(raw_output: str, expected_substr: str) -> bool:
    return expected_substr.lower() in raw_output.lower()

Use when: single canonical answer (e.g. “What’s the latency p95?” → "840 ms"). Fails on: multi-valid-output tasks. F3 lesson: on Knowledge-Audit, strict scored Grok 4.3 at 79.8% but 19/20 “failures” were valid alternate findings the rubric had simply not listed.

`llm_judge` (default)

async def score(raw_output: str, case: EvalCase, rubric: str, judge_model: str) -> bool:
    findings = parse_json_list(raw_output)
    verdicts = await gather(
        judge.complete(rubric_system, fmt(finding, case)) for finding in findings
    )
    return apply_pass_rule(verdicts, case.expected_type)

Use when: multi-valid-output tasks (audit, summarization, translation, classification with overlapping labels). Cost: ~30% on top of generation cost. Cache: judge verdicts cached by (finding_hash, case_id) — re-runs across iteration loops are free.

Default judge model: Grok 4.3 (per dojo eval — em sai 2 lần default Haiku do SDK familiarity). Config-overridable per task.

Holdout protection

The CLI refuses to run on holdout-* eval sets unless the explicit --final-decision flag is set. This is the only mechanism that prevents “just one more iteration on holdout” overfit.

# runner.py
if eval_set_path.name.startswith("holdout-") and not args.final_decision:
    raise GuardError(
        f"Refusing to run on {eval_set_path.name} without --final-decision. "
        f"Use dev-* for iteration; holdout is one-pass only."
    )

After a --final-decision run, the holdout-set hash + judge-list + run-id is appended to a tamper-evident log (eval/holdout-runs.log). Re-running on the same holdout is allowed but flagged loudly in the report (overfit warning).

Stratification

The 192-case bootstrap was generated by Haiku from a corpus sample, then hand-balanced to 5 buckets × 3 languages:

Bucket	Languages	Count
`positive_explicit` (clear contradiction)	VN / EN / mixed	13 each = 39
`positive_subtle` (semantic contradiction)	VN / EN / mixed	13 each = 39
`negative_paraphrase` (looks-like-contradiction, actually same)	VN / EN / mixed	13 each = 39
`negative_no_overlap` (sources don’t share topic)	VN / EN / mixed	13 each = 39
`edge_quote_mismatch` (quote attribution wrong)	VN / EN / mixed	12 each = 36
Total		192

Split: 93 dev + 99 holdout (per-stratum proportional).

Drift monitor (weekly cron)

[launchd weekly Sat 03:00]
    │
    ▼
eval-framework verify --task knowledge_audit
                      --eval-set holdout-30-sample
                      --judges grok-4.3
    │
    ▼
Postgres: write eval_runs row (run_type='drift_check')
    │
    ▼
SQL: compare acc to last 4-week median
    │
    ├── Δacc ≥ -5pp → Telegram P0:
    │   "🔴 Knowledge-Audit Grok 4.3 regressed:
    │    98% → 91% (Δ -7pp). Run full holdout-99 to confirm.
    │    eval_runs.id=5023"
    │
    └── Δacc < -5pp → silent

ADHD-friendly format per memory feedback_adhd_delivery: severity-tagged, action-verb first, no fluff.

Data flow — judge call

Runner (asyncio task per (judge, case))
        │
        ▼
Adapter.complete(system, user)
        │
        ├── SDK retry-with-backoff (3 attempts)
        ├── Capture: text, in_tokens, out_tokens, ms
        └── Cost = in_tokens × $/Mtok_in + out_tokens × $/Mtok_out
        │
        ▼
Scorer.score(text, case)
        │
        ├── parse → list[finding]
        ├── for finding: judge.complete(rubric_prompt, finding)
        │     ← cached by (finding_hash, case_id)
        └── apply_pass_rule(verdicts, case.expected_type) → bool
        │
        ▼
Stats accumulator:
    pass_vector[judge].append(bool)
    cost_total[judge] += cost
    latencies[judge].append(ms)
    per_stratum[judge][stratum].append(bool)

Component responsibilities

Component	Owns	Doesn’t own
Task YAML	system prompt, scoring rubric, judge model defaults	Eval cases, judge selection
Eval set YAML	Cases with `id`, `sources`, `expected`, `stratum`	Task semantics
Runner	Parallel fan-out, retry, semaphore-bounded concurrency	Scoring, statistics
Adapter	Provider SDK calls, cost computation, error normalization	Prompt engineering, scoring
Scorer	Strict vs LLM-judge dispatch, pass-rule application	Generation, statistics
Stats	Bootstrap CI, Cohen’s κ, per-stratum aggregation	Adapter, scoring
Reporter	Table render + Postgres audit-log writes	All compute
Holdout guard	Refuse non-`--final-decision` holdout runs; append to tamper log	All else
Drift cron	Weekly verify + alert	Decision-making (operator confirms)

Failure modes & recovery

Failure	Detect	Recovery	Time
Provider 429 rate limit	Adapter	Exp backoff retry (3 attempts)	<30s
Provider 5xx	Adapter	Retry; if all fail → mark case as `error`, exclude from acc	<60s
Gemini empty `candidates`	Adapter returns `text=""`	Scorer flags as fail; report shows 0% acc loudly	immediate
Judge model deprecated	Adapter raises on first call	Update `judge_model` in task YAML	<5 min
Holdout leak attempt	CLI guard raises	Operator re-runs with explicit `--final-decision` if intended	immediate
Cohen’s κ NaN (perfect agreement)	Stats	Floor κ at 1.0 with `note=degenerate`	immediate
Postgres audit log down	Reporter	Render to stdout still works; warning printed	n/a
Bootstrap resample slow (N>1000)	Stats	Capped at 1000; configurable	n/a
Drift cron false positive	Telegram alert	Operator runs full holdout-99 to confirm before acting	<30 min

Why these choices

Decision	Alternative considered	Why this won
LLM-judge default scorer	Strict substring as default	F3 lesson: strict masked 19/20 valid findings on Knowledge-Audit; 19.2pp accuracy gap
Grok 4.3 default judge (NOT Haiku)	Haiku (em sai 2x do SDK familiarity)	Bake-off: Haiku as direct auditor = 13%, Grok 4.3 = 99% on holdout. Production-verified.
YAML task config	JSON / Python	Multi-line system prompts + comments without escape pain
asyncio fan-out	Threading	Provider SDKs are async-native; cleaner cancellation
Postgres for audit log	SQLite / JSON files	Shared infra with Personal-RAG; cheap drift queries via SQL
Cohen’s κ pairwise	Fleiss’ κ across all	Pairwise reveals same-family bias (Sonnet ↔ Opus = 0.66); aggregate hides it
Bootstrap CI 1000 resamples	Wilson interval / asymptotic	Robust on small N (30 dev); same code path scales to 99 holdout
Stratified by 5 buckets × 3 langs	Random sample	Reveals language bias (Grok VN 88% / EN 63%); essential for VN-heavy corpora
Frozen holdout + CLI guard	Convention only	Operator self-discipline fails under deadline pressure; guard prevents accidents
30-LOC adapter pattern	Heavy abstract base class	Lazy abstraction (YAGNI > DRY) — proven across 4 providers without rewrite
Weekly drift cron	Per-deploy CI eval	Side projects don’t have CI; weekly cron is the SMB-grade safety net
Single-day F1+F2+F3 ship	Phased over a week	Tight feedback loop forces brutal scope cuts; no premature polish

Eval-Framework — Architecture