← All posts
📅

Knowledge-Audit for RAG-KB: From 80% to 99% with Grok 4.3

A real production use case: build a cross-source contradiction detector on Personal-RAG. Verify ground truth with Eval-Framework, swap Haiku → Grok 4.3, real accuracy 80% → 99%.

TL;DR: Knowledge-Audit = a tool that audits a KB itself (Personal-RAG ~50k sources). It detects contradictions across memory + CLAUDE.md + project NOTES. After an Eval-Framework bake-off: swap the Haiku verifier → Grok 4.3, real accuracy 80% → 99%, cost down. A real production swap in one afternoon.

Context — what is a RAG-KB (Retrieval-Augmented Generation Knowledge Base)?

My KB (~50k sources) sits behind Personal-RAG (a private personal RAG server). Sources:

  • Personal memory files (~80 .md files)
  • Workspace CLAUDE.md (4 files)
  • Side-project NOTES.md / README.md / PRD.md (~50 files)
  • Email + Slack + meetings (work)
  • Confluence dump
  • KB notes

→ Each day I add ~5-20 sources. Drift problem: memory yesterday claims X, project NOTES today claims Y — and it goes uncaught.

JTBD

When: I’m shipping a production feature and read memory to recall context.

I want to: ensure the KB has no silent contradictions (e.g., memory says “8GB RAM” but claude_md says “64GB”).

So that: decisions based on the KB aren’t poisoned by stale facts.

Existing solutions — and why they fail

  • Manual review: 50k sources isn’t feasible
  • Git diff across files: catches syntactic change, not semantic contradiction
  • Single source of truth: a KB needs multiple perspectives → can’t enforce
  • LLM scan over the whole KB: cost-prohibitive if run hourly

→ Solution: scheduled cross-source LLM audit — cost scales with volume, model picked via an eval-driven bake-off to optimize the accuracy/cost frontier.

Product hypothesis

3-layer audit:

  • Layer 1: path_check + IP/port format (static, $0)
  • Layer 2: LLM cross-source semantic contradiction (where decisions matter)
  • Layer 3: probe (test paths, ping URLs)

Layer 2 = the quality + cost bottleneck. That’s where Eval-Framework applies.

MVP scope — Layer 2 specifically

ComponentDecision
LLM backendDefault = grok_43 (post-eval)
Trigger4 cron tiers (daily 03:01 / weekly Sat 21:00 / monthly 1st 22:00 / event UserPromptSubmit)
Output_audit_report.md per run
ScoringInternal LLM-judge (Sonnet) for production verification

Eval-driven decision — Eval-Framework bake-off

Before Eval-Framework: assumed “Haiku is enough” (cheap, fast). After the bake-off:

BackendReal accuracyCost/mo @ SMB scale (15k audits/mo)
Haiku 4.5 alone13% (direct auditor)~$60 (low) — but accuracy unusable
Grok 4.3 + Sonnet 4.6 judge (production swap)99% verified holdout~$90
Claude Sonnet 4.6 alone60%~$140
Claude Opus 4.7 alone67%~$285
Promptfoo + Inspect AI SaaS bundledepends on config$500-2,000+

Grok 4.3 = 99% beating premium models, build cost = 5-20× cheaper than SaaS audit tools.

(Scale assumption: SMB KB ~5,000 new sources/month → ~500 audits/day → 15,000 audits/mo. Personal scale is 100× smaller, still applicable but the real value surfaces at SMB scale.)

Build/buy decision

Build (vs buy a SaaS audit tool):

  • Data residency / compliance: KB contains internal product docs + customer data + IP — can’t send to a 3rd-party vendor (GDPR / VN DPD / NDA constraint)
  • Heterogeneous source formats: an enterprise KB mixes Confluence + Jira + Slack threads + email + meeting notes + PRDs — no SaaS audit tool is generic enough to parse all formats out-of-box
  • Favorable cost-quality frontier: SMB scale (~15k audits/mo) builds for ~$90/mo via cloud LLM API vs SaaS ($500-2,000/mo subscription floor) = 5-20× cost gap, payback period 1-2 months
  • Customization velocity: prompt + scoring rubric iterates in one afternoon, no dependency on a vendor roadmap

→ Build = strategic choice when (data sensitive) AND (source format bespoke) AND (volume meaningful). → Buy = better at large enterprise scale (>100k audits/mo) needing SLA + support, or when the team lacks ML engineering capacity.

Design principle: “Audit ≠ critic; audit = surface evidence”

3 finding levels:

  • RED: directly contradicts another source with verbatim quote
  • YELLOW: looks dated/unverified, worth a check
  • (skipped): paraphrase / unrelated

Output JSON:

{
  "memory_file": "personal_rag_hardware.md",
  "level": "RED",
  "type": "cross_source_mismatch",
  "detail": "RAM differs: M2 Max 64GB (personal_rag_hardware) vs M1 16GB (CLAUDE.md)",
  "evidence_chain": {
    "primary_quote": "MBP M2 Max 64GB RAM",
    "primary_file": "personal_rag_hardware.md",
    "conflicting_quote": "M1 MacBook 16GB RAM",
    "conflicting_file": "CLAUDE.md"
  },
  "confidence": 95
}

Audit = evidence-based finding. I decide how to reconcile.

Outcome measure — real audit run example

Saturday weekly audit of my KB → 7 findings in 1 run:

  1. Personal-RAG hardware specs (M2 vs M1 stale memory) ✓ confirmed real
  2. Inko port (8081 vs 9090 in different sources) ✓
  3. workspace count (6 vs 4 across memory files) ✓
  4. Python version for health-coach deployment (3.11 vs 3.10) ✓ confirmed 5-7. …

Memory drift = a real problem, and the audit catches it.

PM takeaways

  1. A RAG-KB isn’t automatically “correct” — it retrieves, but cross-source contradictions still exist
  2. An audit layer on top of RAG = essential for any long-lived KB
  3. Eval-driven model picks matter — don’t assume “premium model = best”
  4. Build-vs-Buy tipping point at SMB scale (~$90/mo build) vs SaaS audit tool ($500-2,000/mo) = 5-20× cost gap, build pays back in 1-2 months

Enterprise application — RAG-KB ops

The pattern is transferable for enterprise RAG:

  • Compliance KB (banking): audit contradictions across policy doc versions
  • Customer support KB: detect agent answer drift over time
  • Engineering docs: catch stale runbook vs current infra
  • Legal contract corpus: cross-clause contradiction detection

Cost scales linearly with KB size. Daily 100-file audit: ~$5/day = $150/mo. Cheap insurance for KB integrity at team scale.

Closing takeaways

3 takeaways any PM managing an LLM feature can use:

  1. Cross-source audit = a supplementary layer for RAG, not a replacement. RAG retrieves, audit verifies integrity.
  2. Eval-driven model picks > intuition. The “cheap default” can be wrong (Haiku 13% on my audit task).
  3. The build/buy framework needs 4 dimensions: data residency, source format heterogeneity, cost-quality frontier, customization velocity.

The pattern applies to any corpus that accumulates over time — not just a personal knowledge base.