Knowledge-Audit — PRD
Size M · P0 · Foundation Status: ✅ Shipped (day-one deploy 2025-11-08, production judge swap 2026-05-23) Build time: ~8 hours initial + ~5 hours Grok swap (one afternoon)
1. Problem
Persistent AI memory drifts. Sources I write today (memory files, workspace CLAUDE.md, project NOTES, meeting transcripts, KB notes) age into stale facts within weeks. The LLM that reads them treats every line as ground truth, has no notion of recency, and doesn’t know what it doesn’t know.
The triggering incident: the AI confidently asserted I was running personal infrastructure on an Oracle Cloud A1 VM (ARM free tier). It cited the workspace CLAUDE.md. The VM was never registered — it was a plan from February that I had filed under “OCI Foundation”. A separate memory file (mail_watcher_project.md) noted “couldn’t register A1” but the AI never cross-referenced. I had been acting on a hallucination for ~6 months.
Pain: undetected contradictions in a long-lived KB poison every downstream decision. Multiply by 50+ memory files × 12 side projects × 4 workspace CLAUDE.md files × ~100 meeting transcripts = drift is inevitable.
Why now: Personal-RAG shipped with 42k+ sources and growing. Daily auto-ingest means new contradictions enter the corpus weekly. Without active validation the trust in the KB collapses within a quarter.
2. Goal & Success Metrics
Goal: every Saturday morning, see a Telegram digest of contradictions and stale facts in my KB — severity-tagged, evidence-quoted, actionable in <2 minutes per finding.
Metrics — actual achieved:
| Metric | Target | Achieved | Note |
|---|---|---|---|
| Findings caught in day-1 deploy | ≥10 | 25+ | 79-file corpus |
| Real accuracy (LLM-judged on holdout) | ≥90% | 99% | After Grok 4.3 swap, Eval-Framework-verified |
| Strict-substring accuracy | n/a | 80% | Pre-LLM-judged — bake-off revealed scoring trap |
| Cost / month | <$10 | ~$5 | All 4 tiers combined |
| Audit cadence | weekly | 4 tiers (daily / weekly / monthly / event) | Plus post-commit hook |
| Time-to-triage 1 finding | <5 min | <2 min | ADHD digest format |
| False positive rate | <20% | ~12% | Measured on 50-finding sample |
3. User journey
- A sync agent (Personal-RAG mount-watcher, meeting transcriber, AI session hook) writes a
.mdfile into the KB mount. - Daily 03:01 local: light audit (4B local model + Haiku 4.5 verifier) over files touched in the last 24h.
- Weekly Saturday 21:00: full audit (8B + Haiku 4.5 verifier) across the entire personal corpus.
- Monthly 1st 22:00: deeper audit (32B alone, longer context windows) including cross-workspace contradictions.
- Event-triggered: a post-commit hook on the KB-s3 mount fires a scoped audit on the changed files within minutes.
- Findings → Grok 4.3 judge verification → severity tag 🟢🟡🔴 → Telegram digest, bundled 2× per day.
- For weekly/monthly: auto-fix candidates pass safety heuristics + Grok 4.3 judge → git snapshot → apply (review-and-merge gate retained for high-risk types).
- User triages in the Telegram thread: ✅ accept / ⏸ snooze / ❌ reject.
4. Scope (MoSCoW) — final
Must — DONE:
- ✅ 3-layer audit (intra-file / cross-file in workspace / cross-workspace)
- ✅ 4-tier cron via launchd (daily / weekly / monthly / event-triggered)
- ✅ Cross-source contradiction detection with evidence-chain output (verbatim quotes)
- ✅ Severity tagging (🟢🟡🔴)
- ✅ Telegram delivery with ADHD-friendly formatting
- ✅ PII redaction before any LLM call (~12 regex patterns + email partial-redact)
- ✅ Production LLM judge (Grok 4.3, swapped from Haiku 4.5 after eval)
Should — DONE:
- ✅ Re-uses Personal-RAG retrieval (bge-m3 + Postgres) — no duplicate index
- ✅ Catch-up logic (launchd
RunAtLoad: true+ skip-if-recent state file) for Mac off/asleep - ✅ Fingerprint-based suppression with expiry (deprecated mid-2026 in favor of in-source explanatory comments — see Notes)
- ✅ Auto-fix policy (Grok propose → safety heuristics → Grok 4.3 judge → git snapshot → apply) for weekly/monthly
- ✅ Manual probe registry (~23 entries for project-specific facts)
Could — partial:
- ⏸️ Reverse audit (scan codebase for facts that should be in memory)
- ⏸️ Adversarial synthetic test corpus with known answers (precision/recall benchmark)
- ✅ Drift trend dashboard — minimal version in Telegram digest (week-over-week RED count)
Won’t (M1) — kept:
- Multi-user / team mode
- Web UI — Telegram is sufficient
- Real-time per-keystroke validation — overkill
5. Architecture (final)
3 detection layers × 4 trigger tiers × 1 judge layer × 1 auto-fix pipeline. See Architecture for diagrams.
6. Tech Stack — final choices
| Layer | Original spec | Implemented | Reason for change |
|---|---|---|---|
| Detection model (light) | Haiku 4.5 | Qwen 4B local + Haiku 4.5 verifier | Local model handles bulk, Haiku verifies only borderline cases — 4× cheaper |
| Detection model (heavy) | Sonnet 4.6 | Qwen 8B local + Haiku 4.5 verifier (weekly), Qwen 32B alone (monthly) | Same logic, scaled by tier |
| Production judge | Haiku 4.5 | Grok 4.3 (xAI) | Eval-Framework bake-off 2026-05-23: Haiku 13% accuracy on audit task, Grok 99% verified on holdout-99 |
| Scoring | Strict substring | LLM-judge (Sonnet 4.6) for production verification | Strict substring counted alternate valid findings as fail (80% → 99% true accuracy when LLM-judged) |
| Retrieval | n/a — read files raw | Re-use Personal-RAG (bge-m3 + pgvector) | Avoid duplicate index, get multilingual retrieval for free |
| Trigger | systemd cron | launchd (RunAtLoad: true + skip-if-recent state file) | Mac off/asleep catch-up |
| Delivery | Telegram Bot API | ADHD-friendly inline triage (✅/⏸/❌) | |
| Auto-fix gate | none | Safety heuristics + Grok 4.3 judge + git snapshot | Reversibility for “apply automatically” tier |
| Runtime host | Cloud VM | Local MacBook Pro M2 Max | $0 cloud cost, low latency, runs alongside Personal-RAG |
Cost posture: ~$5/month at current cadence (daily + weekly + monthly + event-driven). Grok 4.3 judge is the dominant cost line at ~$0.61/mo per Eval-Framework benchmark.
7. Milestones — actual
| Date | What shipped |
|---|---|
| 2025-11-08 | Initial deploy — 3 layers, weekly cron, Telegram digest. Day-one corpus: 79 files, 25+ findings |
| 2025-11-09 | Hardening pass — PII redaction (12 regex patterns), launchd catch-up logic, fingerprint suppression |
| 2025-11-15 | 4-tier cron (daily / weekly / monthly / event) shipped |
| 2026-02-x | Post-commit hook on KB-s3 mount (event-triggered tier) |
| 2026-05-20 | Audit corpus profile measured: VN 21% / mixed 28% / EN 40% → confirms Qwen dense primary, 4B+Haiku sweet spot |
| 2026-05-23 | Production judge swap Haiku → Grok 4.3. Real accuracy 80% (strict) → 99% (LLM-judged) on Eval-Framework holdout-99. Ship cost $0.61/mo |
| 2026-05-24 | Auto-fix policy live: Grok propose → safety heuristics → Grok 4.3 judge → git snapshot → apply. Telegram ADHD format finalized |
8. Cost & Quota
| Item | Cost | Note |
|---|---|---|
| Qwen 4B / 8B / 32B local inference | $0 | Runs on M2 Max MPS |
| Anthropic Haiku 4.5 verifier (daily + weekly) | ~$0.50/mo | ~5K verifications/mo |
| Grok 4.3 judge (production) | ~$0.61/mo | per Eval-Framework benchmark |
| Telegram Bot API | $0 | Free tier |
| launchd / git / Postgres (shared with Personal-RAG) | $0 | Already paid |
| Total | ~$5/mo | All 4 tiers + judge + auto-fix |
Per the Eval-Framework swap memo: this beats every premium-model-alone option (Sonnet alone $140/mo, Opus alone $285/mo, SaaS audit tools $500-2,000/mo).
9. Risks & open questions — outcomes
Original risks:
- LLM rephrases findings each run → fingerprint suppression breaks → resolved by switching to in-source explanatory comments (durable, no hash dependency)
- Mac off at 03:00 → cron misses → resolved via
RunAtLoad: true+ state-file dedupe (3-layer defense-in-depth) - LLM sends raw secrets to Anthropic API → mitigated with 12-pattern PII redaction + 6 unit tests, 100% pass
- LLM-judge cost compounding → resolved via tier-specific model selection (Grok only on production verification, not bulk detection)
Current risks:
- False negatives in monthly tier (Qwen 32B alone, no verifier) — accepted: deeper but slower, used for cross-workspace where speed isn’t critical
- Auto-fix mis-applies — mitigated by git snapshot + heuristics filter + judge verify + review-and-merge gate retained for high-risk types (URL, port, credential, infra topology)
- Drift in the judge itself (model deprecation) — Eval-Framework benchmark re-runs on every model swap
Original open Qs:
- Q1: Run in cloud or local? → ✅ local — no cloud cost, low latency, runs alongside Personal-RAG
- Q2: Replace Haiku with cheaper model? → ✅ partially — Qwen 4B handles bulk, Haiku only verifies borderline
- Q3: How to handle multilingual corpus (VN 21% / mixed 28% / EN 40%)? → ✅ bge-m3 + Qwen dense (both multilingual SOTA)
10. Definition of Done
M1 Done: ✅ 2025-11-08 — 3 layers + weekly cron + Telegram digest live, 25+ real findings caught day-one.
Production-ready (M2) Done: ✅ 2026-05-24 — Grok 4.3 judge swapped (99% real accuracy), 4-tier cadence stable, auto-fix policy live, post-commit hook firing.
M3 (open):
- ⏳ Reverse audit (proactive bootstrap: scan codebase for facts that should be in memory)
- ⏳ Adversarial synthetic test corpus + precision/recall benchmark
- ⏳ Drift trend chart (sparkline of weekly RED count in digest header)
See also
- Architecture — 3-layer × 4-tier diagrams, auto-fix pipeline
- Implementation — code structure, prompts, judge prompt, perf numbers
- Notes — decision log + Grok 4.3 production swap retrospective
- Enterprise — 5 enterprise adaptations