Knowledge-Audit for RAG-KB: From 80% to 99% with Grok 4.3

TL;DR: Knowledge-Audit = a tool that audits a KB itself (Personal-RAG ~50k sources). It detects contradictions across memory + CLAUDE.md + project NOTES. After an Eval-Framework bake-off: swap the Haiku verifier → Grok 4.3, real accuracy 80% → 99%, cost down. A real production swap in one afternoon.

Context — what is a RAG-KB (Retrieval-Augmented Generation Knowledge Base)?

My KB (~50k sources) sits behind Personal-RAG (a private personal RAG server). Sources:

Personal memory files (~80 .md files)
Workspace CLAUDE.md (4 files)
Side-project NOTES.md / README.md / PRD.md (~50 files)
Email + Slack + meetings (work)
Confluence dump
KB notes

→ Each day I add ~5-20 sources. Drift problem: memory yesterday claims X, project NOTES today claims Y — and it goes uncaught.

JTBD

When: I’m shipping a production feature and read memory to recall context.

I want to: ensure the KB has no silent contradictions (e.g., memory says “8GB RAM” but claude_md says “64GB”).

So that: decisions based on the KB aren’t poisoned by stale facts.

Existing solutions — and why they fail

Manual review: 50k sources isn’t feasible
Git diff across files: catches syntactic change, not semantic contradiction
Single source of truth: a KB needs multiple perspectives → can’t enforce
LLM scan over the whole KB: cost-prohibitive if run hourly

→ Solution: scheduled cross-source LLM audit — cost scales with volume, model picked via an eval-driven bake-off to optimize the accuracy/cost frontier.

Product hypothesis

3-layer audit:

Layer 1: path_check + IP/port format (static, $0)
Layer 2: LLM cross-source semantic contradiction (where decisions matter)
Layer 3: probe (test paths, ping URLs)

Layer 2 = the quality + cost bottleneck. That’s where Eval-Framework applies.

MVP scope — Layer 2 specifically

Component	Decision
LLM backend	Default = `grok_43` (post-eval)
Trigger	4 cron tiers (daily 03:01 / weekly Sat 21:00 / monthly 1st 22:00 / event UserPromptSubmit)
Output	`_audit_report.md` per run
Scoring	Internal LLM-judge (Sonnet) for production verification

Eval-driven decision — Eval-Framework bake-off

Before Eval-Framework: assumed “Haiku is enough” (cheap, fast). After the bake-off:

Backend	Real accuracy	Cost/mo @ SMB scale (15k audits/mo)
Haiku 4.5 alone	13% (direct auditor)	~$60 (low) — but accuracy unusable
Grok 4.3 + Sonnet 4.6 judge (production swap)	99% verified holdout	~$90
Claude Sonnet 4.6 alone	60%	~$140
Claude Opus 4.7 alone	67%	~$285
Promptfoo + Inspect AI SaaS bundle	depends on config	$500-2,000+

→ Grok 4.3 = 99% beating premium models, build cost = 5-20× cheaper than SaaS audit tools.

(Scale assumption: SMB KB ~5,000 new sources/month → ~500 audits/day → 15,000 audits/mo. Personal scale is 100× smaller, still applicable but the real value surfaces at SMB scale.)

Build/buy decision

Build (vs buy a SaaS audit tool):

Data residency / compliance: KB contains internal product docs + customer data + IP — can’t send to a 3rd-party vendor (GDPR / VN DPD / NDA constraint)
Heterogeneous source formats: an enterprise KB mixes Confluence + Jira + Slack threads + email + meeting notes + PRDs — no SaaS audit tool is generic enough to parse all formats out-of-box
Favorable cost-quality frontier: SMB scale (~15k audits/mo) builds for ~$90/mo via cloud LLM API vs SaaS ($500-2,000/mo subscription floor) = 5-20× cost gap, payback period 1-2 months
Customization velocity: prompt + scoring rubric iterates in one afternoon, no dependency on a vendor roadmap

→ Build = strategic choice when (data sensitive) AND (source format bespoke) AND (volume meaningful). → Buy = better at large enterprise scale (>100k audits/mo) needing SLA + support, or when the team lacks ML engineering capacity.

Design principle: “Audit ≠ critic; audit = surface evidence”

3 finding levels:

RED: directly contradicts another source with verbatim quote
YELLOW: looks dated/unverified, worth a check
(skipped): paraphrase / unrelated

Output JSON:

{
  "memory_file": "personal_rag_hardware.md",
  "level": "RED",
  "type": "cross_source_mismatch",
  "detail": "RAM differs: M2 Max 64GB (personal_rag_hardware) vs M1 16GB (CLAUDE.md)",
  "evidence_chain": {
    "primary_quote": "MBP M2 Max 64GB RAM",
    "primary_file": "personal_rag_hardware.md",
    "conflicting_quote": "M1 MacBook 16GB RAM",
    "conflicting_file": "CLAUDE.md"
  },
  "confidence": 95
}

Audit = evidence-based finding. I decide how to reconcile.

Outcome measure — real audit run example

Saturday weekly audit of my KB → 7 findings in 1 run:

Personal-RAG hardware specs (M2 vs M1 stale memory) ✓ confirmed real
Inko port (8081 vs 9090 in different sources) ✓
workspace count (6 vs 4 across memory files) ✓
Python version for health-coach deployment (3.11 vs 3.10) ✓ confirmed 5-7. …

→ Memory drift = a real problem, and the audit catches it.

PM takeaways

A RAG-KB isn’t automatically “correct” — it retrieves, but cross-source contradictions still exist
An audit layer on top of RAG = essential for any long-lived KB
Eval-driven model picks matter — don’t assume “premium model = best”
Build-vs-Buy tipping point at SMB scale (~$90/mo build) vs SaaS audit tool ($500-2,000/mo) = 5-20× cost gap, build pays back in 1-2 months

Enterprise application — RAG-KB ops

The pattern is transferable for enterprise RAG:

Compliance KB (banking): audit contradictions across policy doc versions
Customer support KB: detect agent answer drift over time
Engineering docs: catch stale runbook vs current infra
Legal contract corpus: cross-clause contradiction detection

Cost scales linearly with KB size. Daily 100-file audit: ~$5/day = $150/mo. Cheap insurance for KB integrity at team scale.

Closing takeaways

3 takeaways any PM managing an LLM feature can use:

Cross-source audit = a supplementary layer for RAG, not a replacement. RAG retrieves, audit verifies integrity.
Eval-driven model picks > intuition. The “cheap default” can be wrong (Haiku 13% on my audit task).
The build/buy framework needs 4 dimensions: data residency, source format heterogeneity, cost-quality frontier, customization velocity.

The pattern applies to any corpus that accumulates over time — not just a personal knowledge base.