← Back to project
● Shipped P0 Size M Foundation

Eval-Framework — Enterprise

How the personal eval framework architecture adapts to enterprise — five concrete use cases with deltas, scale numbers, cost models, and compliance angles.

Enterprise patterns

The personal version is the strictest constraint case: one operator, one machine, $5/mo budget, audit-trail in a local Postgres. Relaxing those constraints unlocks B2B applications without rewriting the architecture — the methodology (stratified eval → bake-off → LLM-judge scorer → Cohen’s κ → frozen holdout) stays identical. This page documents five concrete adaptations.

What stays vs. what changes

The 5-component methodology — stratified eval set, single task × multi judges, LLM-judge scorer, pairwise Cohen’s κ, frozen holdout — is identical across all enterprise use cases below. The deltas are around scale, governance, integration with vendor LLM ops, drift SLA, and compliance evidence, not around eval mechanics.

Migration matrix: Personal → Enterprise

AspectPersonalEnterprise
Tasks evaluated1 task/project, 8 projectsN task types × M production features, organisation-wide
Eval set size93 dev + 99 holdout (192 total)1K–10K cases per task, stratified by tenant/segment/region
AuthoringHand-curated YAMLMix of hand-curated golden sets + automated mining from production logs + crowd-sourced labels
Judge providers4 cross-cloud (Anthropic, xAI, OpenAI, Google)Same + private fine-tuned models + on-prem deployments + vertical-specialised vendors
ScorerLLM-judge default (Grok 4.3)Same, plus human-in-the-loop sampling for high-stakes domains; SME rubrics signed off by domain experts
StatisticsBootstrap CI 1000 resamples + pairwise κSame + per-segment regression tests + significance vs. control group + multi-armed-bandit for A/B routing
Drift monitoringWeekly cron + TelegramReal-time streaming eval on sampled prod traffic + PagerDuty + SLO dashboards
Compliance evidenceNonePer-run signed audit trail, eval-run hash on-chain or in WORM storage, exportable for auditor review (PCI/SOC2/HIPAA)
Cost model<$5/mo totalPer-eval-run usage-based pricing or seat-based platform fee; bake-off cost is a line item in vendor migration ROI
Holdout governanceCLI guard + tamper logAir-gapped holdout storage, dual-control release, holdout-rotation policy after each model migration
IntegrationCLI + PostgresCI/CD gate (block deploy on regression ≥X pp) + vendor LLM ops platforms (LangSmith/Weights & Biases/Braintrust) + ticketing
Latency SLA on bake-offBest-effort (~12 min)Streaming results within SLA; partial results acceptable if provider X is degraded
Org scale1 personCentralised AI-ops team + per-product PM-owned eval sets; governance forum reviews drift breaches

The architecture diagram doesn’t change — only the labels on each component scale up.


Use case A — B2B SaaS chatbot eval (vendor like CX Genie / Botpress Cloud)

Problem

Conversational AI vendors selling to mid-market and enterprise customers ship monthly model + prompt updates. Today most rely on a static QA team poking the bot ad-hoc, then “shipping and praying”. Customer-reported regressions surface 2-6 weeks post-ship; by then 3-5 cohorts of users have experienced degraded quality, churn risk has spiked, and a costly emergency rollback is needed. The vendor’s own engineering org also can’t prove to enterprise procurement that “our v4.2 is better than v4.1” — the buyer asks for evidence and gets a marketing deck.

Industry datum: 38% of enterprise chatbot procurement RFPs in 2024 now require eval-driven dev evidence (Gartner Chatbot Magic Quadrant 2025).

Persona

Vendor PMs + AI engineers (the seller). Enterprise procurement + risk teams (the buyer). Customer success owners renewing accounts where bot quality is at risk.

Why eval matters

  • Ship-then-pray vs ship-then-measure: without a 7-metric scorecard run pre-deploy, no defensible “we tested it” claim
  • Customer-specific golden sets: each enterprise customer trains the bot on their own corpus → each gets a unique eval set; vendor must evaluate per-customer regression before pushing a shared model update
  • Procurement evidence: bake-off output + per-stratum breakdown is the artifact procurement accepts as proof

What changes from personal version

  • 7-metric scorecard per case: response accuracy, hallucination rate, escalation appropriateness, deflection rate, latency p95, cost/turn, context-quality score (RAG retrieval relevance). Personal version measures 1-2 metrics per task; enterprise scorecard demands all 7 with per-stratum breakdown.
  • Per-customer golden sets: 200-500 cases per enterprise tenant, hand-curated from real conversations + edge cases. ~50-200 customers = ~50K-100K eval cases total.
  • Eval-driven dev gate: every prompt iteration runs against the customer’s golden set + a shared cross-customer regression suite. Block deploy if any customer regresses ≥3pp.
  • A/B production routing: pilot model variant on 5% of traffic with online eval; promote when 95% CI lower bound beats control.

Stack mapping (Eval-Framework primitives → enterprise extension)

Eval-Framework primitiveEnterprise mapping
Task YAMLOne task type per metric (response_accuracy, hallucination, escalation, …) — 7 YAMLs
Stratified eval setPer-customer × per-intent stratification (5-10 intents × N customers × language)
LLM-judge scorerSME-reviewed rubric per metric; judge model varies (Sonnet 4.6 for accuracy, Grok 4.3 for hallucination per dojo eval)
Cohen’s κDetect correlated bias when ensembling Anthropic + OpenAI judges
Frozen holdoutPer-customer holdout rotates quarterly; air-gapped storage
Drift cronReal-time streaming eval on 1% production traffic; alert on ≥3pp regression

Cost estimate (mid-market chatbot vendor, 100 enterprise customers)

  • Eval cases: 100 customers × 300 cases avg = 30K cases
  • Bake-off frequency: weekly per customer (4 judges × 300 cases = 1,200 calls/customer/week)
  • LLM-judge scoring: ~30% on top
  • All-in eval compute: ~$8K/mo (vs $300K-1M/year prevented churn from regression incidents = 30-100× ROI)
  • Eval team: 2 AI ops engineers + 1 product analyst = ~$600K/year fully loaded

Compliance angle

  • SOC2 Type II: eval-run audit log proves “change management with rollback readiness” controls
  • EU AI Act (high-risk chatbots in financial/health): documented eval methodology + holdout governance becomes mandatory evidence
  • Customer contract clauses: “we will not degrade by more than X pp without 30-day notice” — enforceable only with eval infrastructure

Use case B — Fintech AI reconciliation eval (payment platform like LivePayments)

Problem

Payment platforms process T+1 settlement reconciliation: matching incoming bank statements against expected payouts across multi-currency, multi-rail (SWIFT / ACH / SEPA / local rails), with FX, fees, and chargeback adjustments. An LLM-assisted matcher classifies ambiguous cases (suspected duplicates, near-matches, dispute candidates). When the matcher is wrong, money sits in suspense accounts, regulators flag breaks, and ops teams burn hours on manual reconciliation. Vendors face audit pressure to prove any model swap (e.g. moving from Haiku 4.5 → Sonnet 4.6 for accuracy lift) didn’t quietly regress on edge cases.

Industry datum: PCI-DSS v4.0 explicitly requires “documented testing of AI/ML components used in payment processing” before production rollout.

Persona

Payment ops engineers, treasury ops managers, compliance/risk officers, external auditors. Vendor PM responsible for the matcher feature.

Why eval matters

  • Auditability requirement: every model swap must produce signed eval evidence — bake-off result + per-segment regression + holdout-run hash
  • Asymmetric error cost: a false-positive match (auto-clearing when actually a duplicate) costs $X in chargeback exposure; a false-negative (flagging a real match as ambiguous) costs $0.50 in ops time. Scorer must weight asymmetrically.
  • FX edge cases: mid-day FX rate shifts create near-match candidates that look like duplicates; eval set must over-sample these

What changes from personal version

  • Segment-weighted accuracy: weight per case by transaction value tier + currency + rail. Personal version treats all cases equally; here a $1M T+1 settlement weighs 10K× a $100 retail txn.
  • Cost-asymmetric scoring rubric: LLM-judge prompt encodes “false-positive 100× cost of false-negative” so reported accuracy reflects business risk.
  • Live shadow eval: run candidate model in parallel with production matcher on real flow (no auto-action); compare verdicts; promote only when shadow agrees with production on 99.5%+ of cases AND beats production on disputed-case accuracy.
  • Holdout rotation: holdout rotated quarterly with dual-control sign-off (engineering + compliance both must approve release).

Stack mapping

Eval-Framework primitiveEnterprise mapping
Task YAMLreconciliation_match, dispute_classify, fx_edge_case
Stratified eval set(currency × rail × value-tier × FX-volatility) — ~50 strata
LLM-judge scorerCost-asymmetric rubric; SME (treasury ops lead) signs off
Frozen holdoutAir-gapped; quarterly rotation; signed release ceremony
Drift cronDaily on a 1K-case sample; PagerDuty on ≥1pp regression on high-value-tier stratum

Cost estimate (regional payment platform, 5M txns/day)

  • Eval cases: 10K hand-curated + 50K mined from production
  • Daily drift: 1K-case sample × 3 judges = 3K calls/day × $0.003 = $9/day = $270/mo
  • Monthly bake-off (4 judges × 10K): ~$120/run × 4 = $480/mo
  • All-in: ~$750/mo vs $50M+ daily settlement value at risk = trivially worth it

Compliance angle

  • PCI-DSS v4.0: documented AI/ML component testing evidence; per-run audit trail accepted by QSA
  • SOX: model change control with rollback readiness; eval-run hash stored in WORM compliance vault
  • Local payment regulators (e.g. MAS, SBV, BNM): pre-rollout impact assessment proven via stratified holdout result

Use case C — EdTech content-moderation eval (student-data product)

Problem

EdTech platforms serving K-12 (preschool through high school) handle highly regulated minor-data and must moderate every piece of user-generated content (forum posts, chat, assignment submissions, photo uploads). The moderation model classifies content as safe / age-appropriate-warning / blocked. False negatives (harmful content reaches a minor) trigger regulatory fines + reputational catastrophe; false positives (over-blocking benign student work) frustrate teachers + parents + erode adoption. When the vendor swaps moderation models, every regulator and every parent rep wants evidence the new model isn’t more dangerous.

Industry datum: COPPA (US), GDPR-K (EU), Singapore PDPA-minor amendments all require demonstrable testing of AI moderation tools handling minor data; “we tested it” is no longer acceptable — evidence is.

Persona

EdTech product owners, school-board procurement, parent representatives on advisory boards, regulators (FTC / DfE / MOE / KOMINFO equivalents), trust & safety engineers.

Why eval matters

  • Compliance evidence for parents and regulators: the eval report itself becomes a public-facing artifact (“our moderation model achieves 99.2% recall on harmful content across 8 categories, audited quarterly”)
  • Age-appropriate stratification: 4-year-old vs 14-year-old language norms differ wildly; eval set MUST stratify by age cohort
  • PII filter sub-eval: a separate task evaluates the PII redactor that runs before content reaches the moderation model (defense-in-depth)

What changes from personal version

  • Multi-category recall metric: 8-10 harm categories (self-harm, bullying, sexual, violence, drugs, hate, doxxing, scam) each with separate recall target (>99% for self-harm, >95% for others). Personal version measures single-task accuracy; here each category is its own eval task.
  • Age-cohort stratification: 4 cohorts (4-6, 7-10, 11-13, 14-18) × multiple languages × content type (text/image/audio) = 100+ strata. Reveals “moderator works great on teens but misses preschool euphemisms”.
  • Per-school golden sets: top school-district customers contribute curated cases representing their student population; vendor maintains a federated holdout per district.
  • Human-in-the-loop sampling: 1% of flagged content reviewed by trust & safety humans → labels feed back into next eval set (active learning).
  • Public eval report: a quarterly published methodology + headline numbers (similar to Apple’s annual transparency report).

Stack mapping

Eval-Framework primitiveEnterprise mapping
Task YAMLOne per harm category (8-10 YAMLs) + PII filter + age-appropriate language
Stratified eval set(category × age-cohort × language × content-type) — 100+ strata
LLM-judge scorerTrust & safety SME rubric per category; conservative side (false-positive better than false-negative on self-harm)
Frozen holdoutPer-district holdout with parental-consent governance; quarterly rotation
Drift cronReal-time eval on 0.1% of production moderation decisions; SLO dashboard per harm category

Cost estimate (regional K-12 EdTech, 2M MAU)

  • Eval cases: 20K hand-curated by T&S team + 100K mined-and-reviewed
  • Real-time eval: 0.1% × 10M moderation decisions/day = 10K LLM-judge calls/day = ~$30/day = $900/mo
  • Quarterly bake-off across vendor models: ~$500/quarter
  • All-in: ~$1.2K/mo vs (regulatory fine exposure + churn risk + brand damage = unbounded)

Compliance angle

  • COPPA / GDPR-K: documented testing evidence for AI tools processing minor data
  • EU AI Act (high-risk: education): moderation model classified high-risk → mandatory pre-rollout impact assessment + ongoing monitoring documented
  • Parental transparency: public eval report becomes a trust signal in renewals + procurement

Use case D — Healthcare symptom triage eval (clinical decision support)

Problem

Clinical decision support tools that triage incoming patient messages (telemedicine intake, nurse-line, ER pre-triage) classify symptoms into N priority levels (e.g. immediate-ER, urgent-care-within-4h, GP-within-24h, self-care). Models help nurses scale to higher patient volume, but false-negatives are catastrophic (sending a heart-attack patient home as “self-care”). Vendors must prove triage accuracy across the full risk distribution before clinical deployment, and prove it again after every model update.

Industry datum: FDA’s “Predetermined Change Control Plan” (PCCP) framework (2024) requires vendors of AI/ML medical devices to submit a documented eval-and-monitoring plan for any model updates intended to ship post-clearance.

Persona

Clinical product managers, medical directors, FDA / EMA / CDSCO / equivalent regulators, hospital chief medical informatics officers (CMIOs), nurse line operations.

Why eval matters

  • False-negative cost asymmetry: missing a true emergency = patient harm + malpractice exposure; over-triaging to ER = mild inconvenience + cost. Scorer must encode this asymmetry explicitly.
  • LLM-judge with clinician rubric: strict-match scoring fails because there are often 2-3 acceptable triage levels for ambiguous presentations; LLM-judge with clinician-authored rubric captures clinically-equivalent answers.
  • Pre-clearance + post-clearance evidence: every model update requires PCCP-compliant testing documentation.

What changes from personal version

  • Clinician-authored LLM-judge rubric: senior physicians draft the verdict prompt: “A triage of urgent-care-within-4h is VALID for this presentation if the symptom complex falls within X clinical guidelines.” Personal version uses a generic rubric; here every rubric is SME-signed.
  • False-negative-weighted accuracy: report a safety score = 1 - false_negative_rate_on_emergencies, alongside aggregate accuracy. Personal version reports aggregate; here safety score is the headline.
  • Specialty stratification: pediatric / geriatric / obstetric / cardiac / respiratory / mental-health — different presentations, different triage thresholds, different judge rubrics.
  • Counterfactual eval: for each holdout case, run the model with key clinical details perturbed (age ±10y, vitals ±20%) — robustness check.
  • Dual judges: every case scored by 2 independent SME-rubric judges; disagreement → escalate to clinician review (active learning).

Stack mapping

Eval-Framework primitiveEnterprise mapping
Task YAMLOne per specialty (6-8 YAMLs); rubric signed by specialty SME
Stratified eval set(specialty × age × severity × demographics × language) — 200+ strata
LLM-judge scorerClinician-authored rubric per specialty; dual-judge with escalation
Cohen’s κInter-judge κ tracked over time; <0.7 = rubric clarification needed
Frozen holdoutCurated by medical advisory board; rotation tied to clinical guideline updates (annual)
Drift cronDaily on 500-case stratified sample; clinical incident triggers immediate full-holdout re-run

Cost estimate (regional telemedicine platform, 500K patient encounters/month)

  • Eval cases: 5K-10K per specialty hand-curated by medical advisory board = ~50K total
  • Bake-off pre-update: 3 candidate models × 50K = 150K calls + dual-judge = ~$1.5K/update × 4 updates/year = $6K/year
  • Daily drift: 500 cases × 2 judges = 1K calls/day = ~$30/day = $900/mo
  • Medical advisory board honorarium: ~$50K/year
  • All-in: ~$70K/year vs (single missed-emergency malpractice settlement = $1M-10M = trivially worth it)

Compliance angle

  • FDA PCCP: documented eval + monitoring plan for SaMD (Software as a Medical Device) post-clearance changes
  • EU MDR + AI Act (high-risk: medical): pre-rollout impact assessment + post-market surveillance
  • HIPAA: eval cases are de-identified per Safe Harbor; eval-run audit log is itself ePHI-handling-compliant

Use case E — Multi-tenant LLM migration eval (SaaS with N clients, e.g. PCF / NewLife / Ilham / BBL pattern)

Problem

A multi-tenant SaaS platform serves N enterprise clients on a shared infrastructure but with per-client configuration (different system prompts, different RAG corpora, different domain vocabulary, different regulatory regimes). When the platform upgrades the underlying LLM (e.g. Haiku 4.5 → Sonnet 4.6 for cost-quality lift), each client’s experience changes independently — some improve, some regress, depending on how their config interacts with the new model. Without per-client eval, a blanket migration silently regresses 1-2 clients → support escalation → renewal risk → emergency rollback. The PM needs evidence: “we evaluated the migration per-client and only N/M clients meet our regression bar; we will not migrate the remaining clients until we re-tune their prompts.”

Industry datum: matches the LL multi-tenant pattern in the memory ll_multitenant_requirements — PCF / NewLife / Ilham / BBL require the same feature shape but with per-client config; never if-else override, always config/strategy/flag.

Persona

Platform PM (the migrator). Per-client AM / CSM (the relationship owner). Each client’s internal stakeholder (the consumer). Engineering owns the rollout mechanics.

Why eval matters

  • Per-client regression rate: prove for each tenant that the new model is non-regressive on their golden set before flipping their config
  • Per-client prompt re-tune evidence: when migration shows regression, eval the prompt rewrite candidates and pick the one that recovers parity
  • Phased rollout governance: weekly cohort of “next clients to migrate” picked based on eval evidence, not gut

What changes from personal version

  • Per-tenant eval set: 200-500 cases per client, mined from their conversation history + edge cases their CSM has filed. ~10-50 clients = ~10K-25K total cases.
  • Per-tenant pass criterion: each client has a configured “must not regress more than X pp on metric Y”; CLI gate refuses to flip their config flag unless eval evidence shows compliance.
  • Shared cross-tenant regression suite: a separate “common” eval set that all clients share, to catch model-wide regressions that no single-client eval would surface.
  • Migration cohort UI: each week the PM sees a dashboard — “5 clients passed migration eval, 2 failed, 3 pending” — and decides next cohort.
  • Per-tenant prompt re-tune workflow: for failing clients, eval candidate prompt rewrites (v2.1.PCF, v2.1.NewLife, …) before re-running migration eval.

Stack mapping

Eval-Framework primitiveEnterprise mapping
Task YAMLPer-feature × per-client overlay (base task + client config overlay)
Stratified eval set(client × intent × language × complexity) per tenant
LLM-judge scorerPer-client SME rubric; CSM signs off
Cohen’s κCross-client judge agreement — detect when a judge is biased toward one client’s style
Frozen holdoutPer-client holdout; client’s compliance owner signs the release
Drift cronPer-client weekly; alert routed to CSM + platform PM
Migration cohort gateCustom CLI subcommand eval-framework migrate --cohort <week> blocks deploys without passing eval per tenant

Cost estimate (mid-market SaaS, 30 enterprise tenants)

  • Eval cases: 30 tenants × 300 cases avg = 9K + 1K shared = 10K
  • Migration bake-off: 3 candidate models × 10K cases + dual-judge = ~$300/migration × 4 migrations/year = $1.2K/year
  • Weekly per-client drift: 30 clients × 100 cases × 1 judge = 3K calls/week = ~$10/week = $40/mo
  • Per-tenant prompt re-tune eval: ~$50/client/migration when regression hits ~30% of tenants = ~$1.5K/year
  • All-in: ~$3K/year for migration infrastructure vs (a single emergency rollback + 1 lost renewal = $200K-1M = 70-300× ROI)

Compliance angle

  • Per-client SOC2 reports: eval audit trail proves change-management controls per tenant
  • Contract SLAs: many enterprise contracts include “non-regression of by more than X pp” clauses; eval is the enforcement mechanism
  • Data residency: per-client holdouts may need to live in their region (EU client → EU storage); eval runs region-pinned

Cross-cutting patterns

These appear in 3+ use cases above and form a second-tier reusable layer beyond the personal Eval-Framework foundation:

  1. Per-segment / per-tenant eval orchestration: stratify + parallelise + report per-segment pass rates; gate deploys on the worst-segment regression
  2. Cost-asymmetric scoring: LLM-judge rubrics that encode business-risk asymmetry (false-positive vs false-negative cost) → reported accuracy reflects business impact, not raw correctness
  3. Dual-judge with escalation: 2 independent LLM judges per case; disagreement triggers human SME review and feeds active learning
  4. Air-gapped holdout governance: dual-control release ceremony; signed by engineering + compliance; rotation tied to regulatory cadence
  5. Compliance audit-trail export: per-run signed hash, exportable as auditor-consumable artifact (PDF + JSON), retained in WORM storage
  6. CI/CD migration gate: eval-framework migrate --cohort blocks deploys until per-tenant eval evidence passes; replaces “PM approves via Jira” with deterministic gate
  7. Real-time streaming eval: production traffic sampled → LLM-judge in stream → SLO dashboard + alerting; catches drift in hours, not weeks

Building these once = ~8 weeks engineering. Then each new vertical = 2-4 weeks to launch instead of 8-12 weeks.

Go-to-market thinking

The architecture supports 3 plausible business models, each with different pricing / positioning:

ModelTargetPricingSales motion
B2B SaaS — AI eval platformAI vendor teams (chatbot, fintech, edtech, healthtech)Per-eval-run usage + platform seat feePLG signup → trial → upgrade. AE for regulated industries.
Compliance-evidence add-onRegulated AI deployments (fintech/healthtech/edtech)Per-eval-evidence-export + retained-audit-log feeEnterprise direct sales, 6-month cycles
Open-source + managedDevs / smaller vendorsFree OSS + managed cloud $X/moInbound from GitHub stars; convert to managed for ops cost relief

The B2B SaaS — AI eval platform model has the cleanest scaling story: eval volume grows with the vendor’s product success, so revenue is naturally aligned with customer outcomes. The compliance-evidence add-on is highest revenue per deal but requires domain expertise and SME networks per vertical. OSS is brand-building but slowest revenue.

What’s NOT in the personal version that enterprise needs

Realistic gap list — items that are zero-effort in personal version but real engineering investment for enterprise:

GapEffortPriority
Multi-tenant isolation (per-customer eval sets, per-customer compute quota)3-4 weeksP0
SSO / SAML for eval platform2 weeksP0
Air-gapped holdout storage + dual-control release2 weeksP0 (regulated industries)
Audit-trail export (signed PDF + JSON, WORM retention)2-3 weeksP0 (regulated industries)
Real-time streaming eval on production traffic4-6 weeksP1
CI/CD gate (block deploy on regression)1-2 weeksP1
Cost-asymmetric scorer DSL2 weeksP1
SME-rubric authoring UI + version control3-4 weeksP1
Dual-judge with escalation workflow2 weeksP1
Active-learning loop (human label → next eval set)4 weeksP2
Multi-region deployment + data residency2-3 weeksP2 (EU / regulated clients)
Vendor LLM ops platform integrations (LangSmith / W&B / Braintrust)1-2 weeks eachP2
SLO dashboard + PagerDuty integration2 weeksP2

Total to enterprise-ready MVP: ~3-4 months of 2 engineers + 1 month design + ~$10K compliance audit prep.

See also

  • Architecture — the unchanged 5-component methodology that scales across all use cases
  • Implementation — the code that ships personal version; enterprise version extends adapters + adds governance, doesn’t rewrite the core
  • PRD — original problem framing; enterprise framing is a superset
  • Notes — 7 PM-bias catches that apply identically at enterprise scale (tune-before-verify, miss-baseline, refactor-early, bad-ROI-scale, author-context, PII-leak, blanket-generalize)