← Back to project
● Shipped P0 Size M Foundation

Eval-Framework — Architecture

System diagrams, eval pipeline, judge adapter pattern, scoring layer, failure modes.

Architecture

Sister docs: PRD (intent), Implementation (code deep-dive), Notes (decision log).

System view

flowchart TB
    classDef user fill:#cce0e8,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef core fill:#faedd6,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef adapter fill:#e0d5ed,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef store fill:#f4d6db,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px

    subgraph CLI["👤 Operator (PM)"]
        Cmd["eval-framework bake-off
--task T --judges J1,J2,J3
--eval-set dev-93 --scorer llm_judge"] Notebook["Python API
(jupyter notebooks)"] end Cmd --> Runner Notebook --> Runner subgraph Core["🏗️ Eval-Framework core"] Runner["Runner
(asyncio parallel fan-out + retry)"] Config["Task config
(YAML + Pydantic schema)"] Scorer["Scorer
llm_judge (default) /
strict_substring"] Stats["Stats
bootstrap CI 1000x
Cohen's κ pairwise"] Reporter["Reporter
(per-stratum table,
cost, p95 latency)"] Runner --> Config Runner --> Scorer Scorer --> Stats Stats --> Reporter end subgraph Adapters["🔌 Judge adapter layer"] AnthAd["Anthropic adapter
(Haiku/Sonnet/Opus 4.x)"] XaiAd["xAI adapter
(Grok 4-fast, Grok 4.3)"] OaiAd["OpenAI adapter
(GPT-5.4-mini)"] GeminiAd["Gemini adapter
(Flash 2.5)"] MlxAd["Local MLX adapter
(config-shaped, F7 wired)"] end Runner -->|parallel| AnthAd Runner -->|parallel| XaiAd Runner -->|parallel| OaiAd Runner -->|parallel| GeminiAd Runner -.->|deferred| MlxAd Scorer --> JudgeLLM["LLM-Judge
(Sonnet 4.6 default)
VALID / INVALID verdict"] subgraph Stores["🗄️ Eval stores"] Tasks["tasks/
knowledge_audit.yaml
mail_classify.yaml
..."] EvalDev["eval/dev-93.yaml
(iterable)"] EvalHold["eval/holdout-99.yaml
FROZEN — one-pass only"] PG["Postgres 16
eval_runs audit log
(shared w/ Personal-RAG)"] end Config -.reads.-> Tasks Runner -.reads.-> EvalDev Runner -.reads.-> EvalHold Reporter -.writes.-> PG subgraph Monitor["📡 Drift monitor (launchd weekly)"] DriftCron["eval-framework verify
--task T --eval-set holdout-30-sample"] Telegram["Telegram bot
P0 alert if Δacc ≥ 5pp"] DriftCron --> Telegram end DriftCron -.uses.-> Adapters class Cmd,Notebook user class Runner,Config,Scorer,Stats,Reporter core class AnthAd,XaiAd,OaiAd,GeminiAd,MlxAd,JudgeLLM adapter class Tasks,EvalDev,EvalHold,PG store

The 5-component methodology

The framework is opinionated around 5 components discovered during the F3 bake-off. Each is necessary; dropping any one degrades decision quality:

#ComponentPurposeFailure mode if dropped
1Stratified eval set5 buckets × 3 languages = 15 strata; reveals per-segment biasAggregate accuracy hides catastrophic failure modes (e.g. Gemini 99/99 empty)
2Single task × multi judgesApples-to-apples comparison with same SYSTEM_PROMPT + user formatConfounded comparisons; can’t attribute delta to model vs prompt
3LLM-as-judge scorer (for multi-valid-output)Sonnet 4.6 verdicts VALID/INVALID per findingStrict substring undercounts (79.8% real → 99.0% true); kills good models
4Pairwise Cohen’s κDetects correlated bias across same-family judgesEnsembles built on Sonnet+Opus = duplicate errors, false confidence
5Frozen holdoutHeld-out queries never used during tuning; final decision = 1 passOverfit to dev → production regression at first real query

Pipeline

[0] Author defines task YAML:
    tasks/knowledge_audit.yaml
      name: knowledge_audit
      system_prompt: |
        You are an auditor. Extract claims from <source>...
        Compare across sources. Output JSON list of contradictions or [].
      user_template: "Sources:\n{sources}\n\nFindings (JSON):"
      scoring:
        mode: llm_judge
        judge_model: claude-sonnet-4-6
        rubric: |
          For each finding, output VALID iff it is a real contradiction
          (not paraphrase, not over-inference, quote is verbatim).
        pass_rule: positive_at_least_one_valid_or_empty_on_negative



[1] CLI parses:
    eval-framework bake-off --task knowledge_audit
                            --judges grok-4.3,claude-haiku-4.5,claude-opus-4.7,...
                            --eval-set dev-93
                            --scorer llm_judge



[2] Runner loads eval set (dev-93.yaml) → list[EvalCase]
    Each case = {id, sources, expected_findings, stratum:{bucket, lang}}



[3] Parallel fan-out (asyncio.gather, semaphore concurrency=8):
    for judge in judges:
      for case in cases:
        adapter.complete(system_prompt, user_template.format(sources=case.sources))
        → raw_output

    Retry on transient errors (429, 502, timeout): exp backoff 3 attempts.
    Capture: raw_output, latency_ms, input_tokens, output_tokens, cost.



[4] Scorer (mode=llm_judge):
    For each (judge, case):
      Parse raw_output → list[finding]
      For each finding:
        Sonnet 4.6 verdict(finding, case.sources) → VALID | INVALID
      pass = pass_rule(verdicts, case.expected_type)

    Caches Sonnet verdicts by (finding_hash, case_id) → re-runs free.



[5] Stats:
    For each judge:
      acc = sum(pass) / N
      ci_95 = bootstrap_resample(pass_vector, 1000)
      per_stratum_acc = group_by(stratum) → acc per (bucket, lang)
      cost_per_case = total_cost / N
      p95_latency = percentile(latencies, 95)

    Cohen's κ pairwise:
      For each (j1, j2):
        κ = cohen_kappa_score(pass_j1, pass_j2)



[6] Reporter renders to stdout + writes to Postgres:
    ┌────────────────┬──────┬──────────┬──────────┬─────────┬──────┐
    │ Judge          │ Acc  │ CI 95%   │ $/case   │ p95 ms  │ Rank │
    ├────────────────┼──────┼──────────┼──────────┼─────────┼──────┤
    │ Grok 4.3       │ 83.3%│ [76, 89] │ $0.0021  │ 1840    │  1   │
    │ Opus 4.7       │ 66.7%│ [59, 74] │ $0.0089  │ 2110    │  2   │
    │ Sonnet 4.6     │ 60.0%│ [52, 68] │ $0.0034  │ 1320    │  3   │
    │ GPT-5.4-mini   │ 53.3%│ [45, 61] │ $0.0011  │ 980     │  4   │
    │ Grok 4-fast    │ 50.0%│ [42, 58] │ $0.0008  │ 720     │  5   │
    │ Haiku 4.5      │ 13.3%│ [ 8, 20] │ $0.0004  │ 460     │  6   │
    │ Gemini 2.5 Fl. │ 0.0% │ [ 0,  4] │ $0       │ 510     │  7 ⚠ │
    └────────────────┴──────┴──────────┴──────────┴─────────┴──────┘

    Per-stratum + κ matrix printed below.
    eval_runs.id = 4271 (Postgres audit log).



[7] (Operator) Inspect top 2-3 → re-run on holdout-99 ONCE:
    eval-framework bake-off --task knowledge_audit
                            --judges grok-4.3
                            --eval-set holdout-99
                            --final-decision

    → 99.0% LLM-judge / 79.8% strict / $0.61/mo projected production cost
    → Commit decision in project notes; deploy.

Judge adapter pattern

All adapters implement a thin JudgeAdapter protocol:

class JudgeAdapter(Protocol):
    model_id: str
    provider: str

    async def complete(
        self,
        system_prompt: str,
        user_message: str,
        max_tokens: int = 2048,
        temperature: float = 0.0,
    ) -> CompletionResult:
        """Returns (text, input_tokens, output_tokens, latency_ms, cost_usd)."""

Adding a new judge = ~30 LOC: import provider SDK, map kwargs, compute cost from token counts via a static price table. No core framework changes.

Provider quirks handled at adapter layer:

  • Anthropicsystem is a top-level param, not a message
  • xAI — OpenAI-compatible Chat Completions; temperature=0 allowed
  • OpenAI GPT-5.4-mini — Responses API; reasoning.effort defaults handled
  • Gemini — system instruction via system_instruction field; empty candidates array is a real (frequent) failure mode — adapter returns CompletionResult(text="", ...) so scorer can flag

Scoring layer

Two modes, selected per task config:

strict_substring

def score(raw_output: str, expected_substr: str) -> bool:
    return expected_substr.lower() in raw_output.lower()

Use when: single canonical answer (e.g. “What’s the latency p95?”"840 ms"). Fails on: multi-valid-output tasks. F3 lesson: on Knowledge-Audit, strict scored Grok 4.3 at 79.8% but 19/20 “failures” were valid alternate findings the rubric had simply not listed.

llm_judge (default)

async def score(raw_output: str, case: EvalCase, rubric: str, judge_model: str) -> bool:
    findings = parse_json_list(raw_output)
    verdicts = await gather(
        judge.complete(rubric_system, fmt(finding, case)) for finding in findings
    )
    return apply_pass_rule(verdicts, case.expected_type)

Use when: multi-valid-output tasks (audit, summarization, translation, classification with overlapping labels). Cost: ~30% on top of generation cost. Cache: judge verdicts cached by (finding_hash, case_id) — re-runs across iteration loops are free.

Default judge model: Grok 4.3 (per dojo eval — em sai 2 lần default Haiku do SDK familiarity). Config-overridable per task.

Holdout protection

The CLI refuses to run on holdout-* eval sets unless the explicit --final-decision flag is set. This is the only mechanism that prevents “just one more iteration on holdout” overfit.

# runner.py
if eval_set_path.name.startswith("holdout-") and not args.final_decision:
    raise GuardError(
        f"Refusing to run on {eval_set_path.name} without --final-decision. "
        f"Use dev-* for iteration; holdout is one-pass only."
    )

After a --final-decision run, the holdout-set hash + judge-list + run-id is appended to a tamper-evident log (eval/holdout-runs.log). Re-running on the same holdout is allowed but flagged loudly in the report (overfit warning).

Stratification

The 192-case bootstrap was generated by Haiku from a corpus sample, then hand-balanced to 5 buckets × 3 languages:

BucketLanguagesCount
positive_explicit (clear contradiction)VN / EN / mixed13 each = 39
positive_subtle (semantic contradiction)VN / EN / mixed13 each = 39
negative_paraphrase (looks-like-contradiction, actually same)VN / EN / mixed13 each = 39
negative_no_overlap (sources don’t share topic)VN / EN / mixed13 each = 39
edge_quote_mismatch (quote attribution wrong)VN / EN / mixed12 each = 36
Total192

Split: 93 dev + 99 holdout (per-stratum proportional).

Drift monitor (weekly cron)

[launchd weekly Sat 03:00]


eval-framework verify --task knowledge_audit
                      --eval-set holdout-30-sample
                      --judges grok-4.3


Postgres: write eval_runs row (run_type='drift_check')


SQL: compare acc to last 4-week median

    ├── Δacc ≥ -5pp → Telegram P0:
    │   "🔴 Knowledge-Audit Grok 4.3 regressed:
    │    98% → 91% (Δ -7pp). Run full holdout-99 to confirm.
    │    eval_runs.id=5023"

    └── Δacc < -5pp → silent

ADHD-friendly format per memory feedback_adhd_delivery: severity-tagged, action-verb first, no fluff.

Data flow — judge call

Runner (asyncio task per (judge, case))


Adapter.complete(system, user)

        ├── SDK retry-with-backoff (3 attempts)
        ├── Capture: text, in_tokens, out_tokens, ms
        └── Cost = in_tokens × $/Mtok_in + out_tokens × $/Mtok_out


Scorer.score(text, case)

        ├── parse → list[finding]
        ├── for finding: judge.complete(rubric_prompt, finding)
        │     ← cached by (finding_hash, case_id)
        └── apply_pass_rule(verdicts, case.expected_type) → bool


Stats accumulator:
    pass_vector[judge].append(bool)
    cost_total[judge] += cost
    latencies[judge].append(ms)
    per_stratum[judge][stratum].append(bool)

Component responsibilities

ComponentOwnsDoesn’t own
Task YAMLsystem prompt, scoring rubric, judge model defaultsEval cases, judge selection
Eval set YAMLCases with id, sources, expected, stratumTask semantics
RunnerParallel fan-out, retry, semaphore-bounded concurrencyScoring, statistics
AdapterProvider SDK calls, cost computation, error normalizationPrompt engineering, scoring
ScorerStrict vs LLM-judge dispatch, pass-rule applicationGeneration, statistics
StatsBootstrap CI, Cohen’s κ, per-stratum aggregationAdapter, scoring
ReporterTable render + Postgres audit-log writesAll compute
Holdout guardRefuse non---final-decision holdout runs; append to tamper logAll else
Drift cronWeekly verify + alertDecision-making (operator confirms)

Failure modes & recovery

FailureDetectRecoveryTime
Provider 429 rate limitAdapterExp backoff retry (3 attempts)<30s
Provider 5xxAdapterRetry; if all fail → mark case as error, exclude from acc<60s
Gemini empty candidatesAdapter returns text=""Scorer flags as fail; report shows 0% acc loudlyimmediate
Judge model deprecatedAdapter raises on first callUpdate judge_model in task YAML<5 min
Holdout leak attemptCLI guard raisesOperator re-runs with explicit --final-decision if intendedimmediate
Cohen’s κ NaN (perfect agreement)StatsFloor κ at 1.0 with note=degenerateimmediate
Postgres audit log downReporterRender to stdout still works; warning printedn/a
Bootstrap resample slow (N>1000)StatsCapped at 1000; configurablen/a
Drift cron false positiveTelegram alertOperator runs full holdout-99 to confirm before acting<30 min

Why these choices

DecisionAlternative consideredWhy this won
LLM-judge default scorerStrict substring as defaultF3 lesson: strict masked 19/20 valid findings on Knowledge-Audit; 19.2pp accuracy gap
Grok 4.3 default judge (NOT Haiku)Haiku (em sai 2x do SDK familiarity)Bake-off: Haiku as direct auditor = 13%, Grok 4.3 = 99% on holdout. Production-verified.
YAML task configJSON / PythonMulti-line system prompts + comments without escape pain
asyncio fan-outThreadingProvider SDKs are async-native; cleaner cancellation
Postgres for audit logSQLite / JSON filesShared infra with Personal-RAG; cheap drift queries via SQL
Cohen’s κ pairwiseFleiss’ κ across allPairwise reveals same-family bias (Sonnet ↔ Opus = 0.66); aggregate hides it
Bootstrap CI 1000 resamplesWilson interval / asymptoticRobust on small N (30 dev); same code path scales to 99 holdout
Stratified by 5 buckets × 3 langsRandom sampleReveals language bias (Grok VN 88% / EN 63%); essential for VN-heavy corpora
Frozen holdout + CLI guardConvention onlyOperator self-discipline fails under deadline pressure; guard prevents accidents
30-LOC adapter patternHeavy abstract base classLazy abstraction (YAGNI > DRY) — proven across 4 providers without rewrite
Weekly drift cronPer-deploy CI evalSide projects don’t have CI; weekly cron is the SMB-grade safety net
Single-day F1+F2+F3 shipPhased over a weekTight feedback loop forces brutal scope cuts; no premature polish

See also

  • Sequence diagrams for bake-off + scoring in Implementation
  • 7 PM-bias catches in Notes — the decision-quality layer above the architecture
  • Enterprise adaptations of this same architecture in Enterprise