TL;DR: Knowledge-Audit = a tool that audits a KB itself (Personal-RAG ~50k sources). It detects contradictions across memory + CLAUDE.md + project NOTES. After an Eval-Framework bake-off: swap the Haiku verifier → Grok 4.3, real accuracy 80% → 99%, cost down. A real production swap in one afternoon.
Context — what is a RAG-KB (Retrieval-Augmented Generation Knowledge Base)?
My KB (~50k sources) sits behind Personal-RAG (a private personal RAG server). Sources:
- Personal memory files (~80 .md files)
- Workspace CLAUDE.md (4 files)
- Side-project NOTES.md / README.md / PRD.md (~50 files)
- Email + Slack + meetings (work)
- Confluence dump
- KB notes
→ Each day I add ~5-20 sources. Drift problem: memory yesterday claims X, project NOTES today claims Y — and it goes uncaught.
JTBD
When: I’m shipping a production feature and read memory to recall context.
I want to: ensure the KB has no silent contradictions (e.g., memory says “8GB RAM” but claude_md says “64GB”).
So that: decisions based on the KB aren’t poisoned by stale facts.
Existing solutions — and why they fail
- Manual review: 50k sources isn’t feasible
- Git diff across files: catches syntactic change, not semantic contradiction
- Single source of truth: a KB needs multiple perspectives → can’t enforce
- LLM scan over the whole KB: cost-prohibitive if run hourly
→ Solution: scheduled cross-source LLM audit — cost scales with volume, model picked via an eval-driven bake-off to optimize the accuracy/cost frontier.
Product hypothesis
3-layer audit:
- Layer 1: path_check + IP/port format (static, $0)
- Layer 2: LLM cross-source semantic contradiction (where decisions matter)
- Layer 3: probe (test paths, ping URLs)
Layer 2 = the quality + cost bottleneck. That’s where Eval-Framework applies.
MVP scope — Layer 2 specifically
| Component | Decision |
|---|---|
| LLM backend | Default = grok_43 (post-eval) |
| Trigger | 4 cron tiers (daily 03:01 / weekly Sat 21:00 / monthly 1st 22:00 / event UserPromptSubmit) |
| Output | _audit_report.md per run |
| Scoring | Internal LLM-judge (Sonnet) for production verification |
Eval-driven decision — Eval-Framework bake-off
Before Eval-Framework: assumed “Haiku is enough” (cheap, fast). After the bake-off:
| Backend | Real accuracy | Cost/mo @ SMB scale (15k audits/mo) |
|---|---|---|
| Haiku 4.5 alone | 13% (direct auditor) | ~$60 (low) — but accuracy unusable |
| Grok 4.3 + Sonnet 4.6 judge (production swap) | 99% verified holdout | ~$90 |
| Claude Sonnet 4.6 alone | 60% | ~$140 |
| Claude Opus 4.7 alone | 67% | ~$285 |
| Promptfoo + Inspect AI SaaS bundle | depends on config | $500-2,000+ |
→ Grok 4.3 = 99% beating premium models, build cost = 5-20× cheaper than SaaS audit tools.
(Scale assumption: SMB KB ~5,000 new sources/month → ~500 audits/day → 15,000 audits/mo. Personal scale is 100× smaller, still applicable but the real value surfaces at SMB scale.)
Build/buy decision
Build (vs buy a SaaS audit tool):
- Data residency / compliance: KB contains internal product docs + customer data + IP — can’t send to a 3rd-party vendor (GDPR / VN DPD / NDA constraint)
- Heterogeneous source formats: an enterprise KB mixes Confluence + Jira + Slack threads + email + meeting notes + PRDs — no SaaS audit tool is generic enough to parse all formats out-of-box
- Favorable cost-quality frontier: SMB scale (~15k audits/mo) builds for ~$90/mo via cloud LLM API vs SaaS ($500-2,000/mo subscription floor) = 5-20× cost gap, payback period 1-2 months
- Customization velocity: prompt + scoring rubric iterates in one afternoon, no dependency on a vendor roadmap
→ Build = strategic choice when (data sensitive) AND (source format bespoke) AND (volume meaningful). → Buy = better at large enterprise scale (>100k audits/mo) needing SLA + support, or when the team lacks ML engineering capacity.
Design principle: “Audit ≠ critic; audit = surface evidence”
3 finding levels:
- RED: directly contradicts another source with verbatim quote
- YELLOW: looks dated/unverified, worth a check
- (skipped): paraphrase / unrelated
Output JSON:
{
"memory_file": "personal_rag_hardware.md",
"level": "RED",
"type": "cross_source_mismatch",
"detail": "RAM differs: M2 Max 64GB (personal_rag_hardware) vs M1 16GB (CLAUDE.md)",
"evidence_chain": {
"primary_quote": "MBP M2 Max 64GB RAM",
"primary_file": "personal_rag_hardware.md",
"conflicting_quote": "M1 MacBook 16GB RAM",
"conflicting_file": "CLAUDE.md"
},
"confidence": 95
}
Audit = evidence-based finding. I decide how to reconcile.
Outcome measure — real audit run example
Saturday weekly audit of my KB → 7 findings in 1 run:
- Personal-RAG hardware specs (M2 vs M1 stale memory) ✓ confirmed real
- Inko port (8081 vs 9090 in different sources) ✓
- workspace count (6 vs 4 across memory files) ✓
- Python version for health-coach deployment (3.11 vs 3.10) ✓ confirmed 5-7. …
→ Memory drift = a real problem, and the audit catches it.
PM takeaways
- A RAG-KB isn’t automatically “correct” — it retrieves, but cross-source contradictions still exist
- An audit layer on top of RAG = essential for any long-lived KB
- Eval-driven model picks matter — don’t assume “premium model = best”
- Build-vs-Buy tipping point at SMB scale (~$90/mo build) vs SaaS audit tool ($500-2,000/mo) = 5-20× cost gap, build pays back in 1-2 months
Enterprise application — RAG-KB ops
The pattern is transferable for enterprise RAG:
- Compliance KB (banking): audit contradictions across policy doc versions
- Customer support KB: detect agent answer drift over time
- Engineering docs: catch stale runbook vs current infra
- Legal contract corpus: cross-clause contradiction detection
Cost scales linearly with KB size. Daily 100-file audit: ~$5/day = $150/mo. Cheap insurance for KB integrity at team scale.
Closing takeaways
3 takeaways any PM managing an LLM feature can use:
- Cross-source audit = a supplementary layer for RAG, not a replacement. RAG retrieves, audit verifies integrity.
- Eval-driven model picks > intuition. The “cheap default” can be wrong (Haiku 13% on my audit task).
- The build/buy framework needs 4 dimensions: data residency, source format heterogeneity, cost-quality frontier, customization velocity.
The pattern applies to any corpus that accumulates over time — not just a personal knowledge base.