Knowledge-Audit — PRD

Size M · P0 · Foundation Status: ✅ Shipped (day-one deploy 2025-11-08, production judge swap 2026-05-23) Build time: ~8 hours initial + ~5 hours Grok swap (one afternoon)

1. Problem

Persistent AI memory drifts. Sources I write today (memory files, workspace CLAUDE.md, project NOTES, meeting transcripts, KB notes) age into stale facts within weeks. The LLM that reads them treats every line as ground truth, has no notion of recency, and doesn’t know what it doesn’t know.

The triggering incident: the AI confidently asserted I was running personal infrastructure on an Oracle Cloud A1 VM (ARM free tier). It cited the workspace CLAUDE.md. The VM was never registered — it was a plan from February that I had filed under “OCI Foundation”. A separate memory file (mail_watcher_project.md) noted “couldn’t register A1” but the AI never cross-referenced. I had been acting on a hallucination for ~6 months.

Pain: undetected contradictions in a long-lived KB poison every downstream decision. Multiply by 50+ memory files × 12 side projects × 4 workspace CLAUDE.md files × ~100 meeting transcripts = drift is inevitable.

Why now: Personal-RAG shipped with 42k+ sources and growing. Daily auto-ingest means new contradictions enter the corpus weekly. Without active validation the trust in the KB collapses within a quarter.

2. Goal & Success Metrics

Goal: every Saturday morning, see a Telegram digest of contradictions and stale facts in my KB — severity-tagged, evidence-quoted, actionable in <2 minutes per finding.

Metrics — actual achieved:

Metric	Target	Achieved	Note
Findings caught in day-1 deploy	≥10	25+	79-file corpus
Real accuracy (LLM-judged on holdout)	≥90%	99%	After Grok 4.3 swap, Eval-Framework-verified
Strict-substring accuracy	n/a	80%	Pre-LLM-judged — bake-off revealed scoring trap
Cost / month	<$10	~$5	All 4 tiers combined
Audit cadence	weekly	4 tiers (daily / weekly / monthly / event)	Plus post-commit hook
Time-to-triage 1 finding	<5 min	<2 min	ADHD digest format
False positive rate	<20%	~12%	Measured on 50-finding sample

3. User journey

A sync agent (Personal-RAG mount-watcher, meeting transcriber, AI session hook) writes a .md file into the KB mount.
Daily 03:01 local: light audit (4B local model + Haiku 4.5 verifier) over files touched in the last 24h.
Weekly Saturday 21:00: full audit (8B + Haiku 4.5 verifier) across the entire personal corpus.
Monthly 1st 22:00: deeper audit (32B alone, longer context windows) including cross-workspace contradictions.
Event-triggered: a post-commit hook on the KB-s3 mount fires a scoped audit on the changed files within minutes.
Findings → Grok 4.3 judge verification → severity tag 🟢🟡🔴 → Telegram digest, bundled 2× per day.
For weekly/monthly: auto-fix candidates pass safety heuristics + Grok 4.3 judge → git snapshot → apply (review-and-merge gate retained for high-risk types).
User triages in the Telegram thread: ✅ accept / ⏸ snooze / ❌ reject.

4. Scope (MoSCoW) — final

Must — DONE:

✅ 3-layer audit (intra-file / cross-file in workspace / cross-workspace)
✅ 4-tier cron via launchd (daily / weekly / monthly / event-triggered)
✅ Cross-source contradiction detection with evidence-chain output (verbatim quotes)
✅ Severity tagging (🟢🟡🔴)
✅ Telegram delivery with ADHD-friendly formatting
✅ PII redaction before any LLM call (~12 regex patterns + email partial-redact)
✅ Production LLM judge (Grok 4.3, swapped from Haiku 4.5 after eval)

Should — DONE:

✅ Re-uses Personal-RAG retrieval (bge-m3 + Postgres) — no duplicate index
✅ Catch-up logic (launchd RunAtLoad: true + skip-if-recent state file) for Mac off/asleep
✅ Fingerprint-based suppression with expiry (deprecated mid-2026 in favor of in-source explanatory comments — see Notes)
✅ Auto-fix policy (Grok propose → safety heuristics → Grok 4.3 judge → git snapshot → apply) for weekly/monthly
✅ Manual probe registry (~23 entries for project-specific facts)

Could — partial:

⏸️ Reverse audit (scan codebase for facts that should be in memory)
⏸️ Adversarial synthetic test corpus with known answers (precision/recall benchmark)
✅ Drift trend dashboard — minimal version in Telegram digest (week-over-week RED count)

Won’t (M1) — kept:

Multi-user / team mode
Web UI — Telegram is sufficient
Real-time per-keystroke validation — overkill

5. Architecture (final)

3 detection layers × 4 trigger tiers × 1 judge layer × 1 auto-fix pipeline. See Architecture for diagrams.

6. Tech Stack — final choices

Layer	Original spec	Implemented	Reason for change
Detection model (light)	Haiku 4.5	Qwen 4B local + Haiku 4.5 verifier	Local model handles bulk, Haiku verifies only borderline cases — 4× cheaper
Detection model (heavy)	Sonnet 4.6	Qwen 8B local + Haiku 4.5 verifier (weekly), Qwen 32B alone (monthly)	Same logic, scaled by tier
Production judge	Haiku 4.5	Grok 4.3 (xAI)	Eval-Framework bake-off 2026-05-23: Haiku 13% accuracy on audit task, Grok 99% verified on holdout-99
Scoring	Strict substring	LLM-judge (Sonnet 4.6) for production verification	Strict substring counted alternate valid findings as fail (80% → 99% true accuracy when LLM-judged)
Retrieval	n/a — read files raw	Re-use Personal-RAG (bge-m3 + pgvector)	Avoid duplicate index, get multilingual retrieval for free
Trigger	systemd cron	launchd (`RunAtLoad: true` + skip-if-recent state file)	Mac off/asleep catch-up
Delivery	email	Telegram Bot API	ADHD-friendly inline triage (✅/⏸/❌)
Auto-fix gate	none	Safety heuristics + Grok 4.3 judge + git snapshot	Reversibility for “apply automatically” tier
Runtime host	Cloud VM	Local MacBook Pro M2 Max	$0 cloud cost, low latency, runs alongside Personal-RAG

Cost posture: ~$5/month at current cadence (daily + weekly + monthly + event-driven). Grok 4.3 judge is the dominant cost line at ~$0.61/mo per Eval-Framework benchmark.

7. Milestones — actual

Date	What shipped
2025-11-08	Initial deploy — 3 layers, weekly cron, Telegram digest. Day-one corpus: 79 files, 25+ findings
2025-11-09	Hardening pass — PII redaction (12 regex patterns), launchd catch-up logic, fingerprint suppression
2025-11-15	4-tier cron (daily / weekly / monthly / event) shipped
2026-02-x	Post-commit hook on KB-s3 mount (event-triggered tier)
2026-05-20	Audit corpus profile measured: VN 21% / mixed 28% / EN 40% → confirms Qwen dense primary, 4B+Haiku sweet spot
2026-05-23	Production judge swap Haiku → Grok 4.3. Real accuracy 80% (strict) → 99% (LLM-judged) on Eval-Framework holdout-99. Ship cost $0.61/mo
2026-05-24	Auto-fix policy live: Grok propose → safety heuristics → Grok 4.3 judge → git snapshot → apply. Telegram ADHD format finalized

8. Cost & Quota

Item	Cost	Note
Qwen 4B / 8B / 32B local inference	$0	Runs on M2 Max MPS
Anthropic Haiku 4.5 verifier (daily + weekly)	~$0.50/mo	~5K verifications/mo
Grok 4.3 judge (production)	~$0.61/mo	per Eval-Framework benchmark
Telegram Bot API	$0	Free tier
launchd / git / Postgres (shared with Personal-RAG)	$0	Already paid
Total	~$5/mo	All 4 tiers + judge + auto-fix

Per the Eval-Framework swap memo: this beats every premium-model-alone option (Sonnet alone $140/mo, Opus alone $285/mo, SaaS audit tools $500-2,000/mo).

9. Risks & open questions — outcomes

Original risks:

LLM rephrases findings each run → fingerprint suppression breaks → resolved by switching to in-source explanatory comments (durable, no hash dependency)
Mac off at 03:00 → cron misses → resolved via RunAtLoad: true + state-file dedupe (3-layer defense-in-depth)
LLM sends raw secrets to Anthropic API → mitigated with 12-pattern PII redaction + 6 unit tests, 100% pass
LLM-judge cost compounding → resolved via tier-specific model selection (Grok only on production verification, not bulk detection)

Current risks:

False negatives in monthly tier (Qwen 32B alone, no verifier) — accepted: deeper but slower, used for cross-workspace where speed isn’t critical
Auto-fix mis-applies — mitigated by git snapshot + heuristics filter + judge verify + review-and-merge gate retained for high-risk types (URL, port, credential, infra topology)
Drift in the judge itself (model deprecation) — Eval-Framework benchmark re-runs on every model swap

Original open Qs:

Q1: Run in cloud or local? → ✅ local — no cloud cost, low latency, runs alongside Personal-RAG
Q2: Replace Haiku with cheaper model? → ✅ partially — Qwen 4B handles bulk, Haiku only verifies borderline
Q3: How to handle multilingual corpus (VN 21% / mixed 28% / EN 40%)? → ✅ bge-m3 + Qwen dense (both multilingual SOTA)

10. Definition of Done

M1 Done: ✅ 2025-11-08 — 3 layers + weekly cron + Telegram digest live, 25+ real findings caught day-one.

Production-ready (M2) Done: ✅ 2026-05-24 — Grok 4.3 judge swapped (99% real accuracy), 4-tier cadence stable, auto-fix policy live, post-commit hook firing.

M3 (open):

⏳ Reverse audit (proactive bootstrap: scan codebase for facts that should be in memory)
⏳ Adversarial synthetic test corpus + precision/recall benchmark
⏳ Drift trend chart (sparkline of weekly RED count in digest header)

Knowledge-Audit — PRD

Knowledge-Audit — PRD

1. Problem

2. Goal & Success Metrics

3. User journey

4. Scope (MoSCoW) — final

5. Architecture (final)

6. Tech Stack — final choices

7. Milestones — actual

8. Cost & Quota

9. Risks & open questions — outcomes

10. Definition of Done

See also