Notes & Decision Log
Format: YYYY-MM-DD — context — decision/finding.
Decisions
- 2026-05-24 — Auto-fix policy live. Weekly/monthly tiers now route verified findings through: Grok 4.3 propose → safety heuristics filter (≤5 lines, no high-risk path, scoped to 1 file, no rm/DROP) → Grok 4.3 judge re-verify →
git stash --include-untrackedsnapshot → apply. Telegram digest shows the revert SHA for one-click rollback. Manual review-and-merge gate retained for high-risk types (URL/port/credential/infra topology). ADHD format finalized: severity emoji first, action verb opens, time-box estimate, bundled 2×/day. - 2026-05-23 — PRODUCTION SWAP: Haiku 4.5 → Grok 4.3 as judge. After Eval-Framework bake-off on a 99-finding holdout. Strict-substring scoring said Haiku 13% / Sonnet 60% / Opus 67% / Grok 80%. LLM-judged real-accuracy (the right metric) said Grok 99%. Strict scorer counted alternate valid findings as failures — the “Eval Scorer Strict Trap”. Production cost at personal scale ~$0.61/mo; at SMB scale (15K audits/mo) ~$90/mo vs SaaS audit tools $500–2,000/mo. Shipped same afternoon.
- 2026-05-20 — Audit corpus profile measured: VN 21% / mixed 28% / EN 40% → effective ~50/50 bilingual. Confirms Qwen dense primary (bge-m3 retrieval + Qwen detector both multilingual SOTA). Code-mixed 28% = the hardest bucket; flagged as a dedicated eval subtask for Eval-Framework. 4B + Haiku verifier remains the daily sweet spot (80% accuracy, 3s, 2.5 GB RAM).
- 2026-02-x — Event-triggered tier added. Post-commit hook on the KB-s3 mount (Personal-RAG canonical write target) fires a scoped audit on the changed files within minutes. Catches drift introduced by manual file edits before it propagates into AI assertion chains.
- 2025-11-15 — 4-tier cron complete. Initial weekly cadence extended to daily (light, 4B+Haiku) + weekly (8B+Haiku) + monthly (32B alone, Layer 3 cross-workspace). Tier-specific models because: most findings are caught by smaller scope at higher frequency; cross-workspace contradictions are slower-evolving and tolerate monthly.
- 2025-11-09 — Hardening pass. PII redaction (12 regex patterns + email partial-redact, 6 unit tests, 100% pass). launchd
RunAtLoad: true+ skip-if-recent state file for Mac off/sleep catch-up. Fingerprint-based suppression added with expiry (later deprecated — see gotcha below). - 2025-11-08 — Day-one deploy. 3 layers (intra-file static / cross-file LLM / cross-workspace LLM) + weekly cron + Telegram digest. Initial corpus: 79 files. First run surfaced 25+ real findings including the Oracle A1 hallucination, the rag-kb hardware drift (M1 16 GB → M2 Max 64 GB), workspace count inconsistency (6 vs 4), Inko port mismatch (8081 vs 9090), Python version drift on health-coach (3.10 vs 3.11). All 25+ confirmed real on manual review.
- 2025-11-08 (early) — Build vs Buy decision: build. SaaS audit tools (Promptfoo + Inspect AI bundle, $500–2,000/mo) couldn’t be considered because: (a) KB contains internal product docs + customer data + IP (data residency / VN DPD / NDA), (b) heterogeneous source formats (memory + CLAUDE.md + project NOTES + meeting transcripts + Confluence dump) — no SaaS tool parses all of these, (c) cost-quality frontier favors build at SMB scale (~$90/mo vs $500-2K), payback 1–2 months, (d) prompt iteration in one afternoon vs vendor roadmap dependency.
- 2025-11-08 (early) — Re-use Personal-RAG retrieval, don’t build own index. bge-m3 + Postgres + pgvector already index the corpus; duplicating would waste 3.2 GB of disk and the bilingual SOTA retrieval quality already measured (Hit@3 = 97.8%). Audit reads through Personal-RAG for cross-source lookup.
Gotchas
- 2026-05-23 — Eval Scorer Strict Trap. Strict-substring scoring on multi-valid-output tasks counts alternate valid findings as failures. Grok 4.3 was real-99% but strict-80% — easy to under-pick a backend if you tune-before-LLM-judge. Cross-ref: same lesson hit on Dojo eval the same week. Verify with LLM-judge BEFORE tuning prompt/model.
- 2026-01-x — Fingerprint suppression breaks. First version stored
(file, finding_type, evidence_hash)fingerprints insuppressions.jsonwith expiry. Added 14 suppressions for legitimate-but-flagged findings. Next run: 7 NEW findings, all logically equivalent to suppressed ones but with different fingerprints — the LLM rephrases findings on every run (slightly different wording, different example quote, different hash). Was playing whack-a-mole. Fix: deprecated fingerprint suppression. Instead, add an explanatory<!-- NOTE: ... -->comment in the source file itself. The LLM reads the file, sees the explanation, stops flagging. Durable, no hash dependency. Deeper lesson: bad suppression hides the problem; good context teaches the auditor. - 2025-11-x — macOS
crondoesn’t catch up missed runs. If the Mac is asleep at 03:00, cron never fires. Fix: use launchd withRunAtLoad: true+ state-file skip-if-recent guard. On wake/boot the audit fires, state file prevents duplicate runs in the same window. Defense-in-depth: three independent catch-up layers. - 2025-11-x — Hooks running binaries inside
~/Documents/hang silently until Full Disk Access is granted per-binary in System Settings. Fix: grant FDA to/bin/bashand the venv python, or keep scripts outside~/Documents/. - 2025-11-08 — Default cron was sending findings on first run before redaction was wired up. Caught immediately on review; rotated the leaked API key that was in one of the sources. Mandatory: redaction unit test in CI before any deploy that touches LLM calls.
- 2025-11-08 — MLX OOM on Qwen 32B with full corpus. First monthly run blew up with a Metal “Impacting Interactivity” crash. Fix: bumped
--max-seqdown to 4096 for 32B tier + gradient-checkpoint mode + only run when no other heavy app is open. Cross-ref the broadermlx-lm LoRA Metal crashlesson — same root cause (competing with Personal-RAG daemon).
Production-swap retrospective: Haiku 4.5 → Grok 4.3
The single most impactful decision in the project’s life. Sequence:
- Shipped with Haiku 4.5 as judge (cheap, fast, familiar). Strict-substring scorer reported 13% accuracy on the audit task. Assumed “audit must be hard for small models”.
- Built Eval-Framework for unrelated reasons.
- Ran knowledge-audit task through Eval-Framework’s bake-off with LLM-judge scoring → real accuracy numbers were very different from strict-substring.
- Grok 4.3 emerged as the dominant pick: 99% real accuracy (vs Haiku 13%), at 1/3 the cost of Sonnet alone, beating Opus by 30+ points.
- Same afternoon: swap in production, re-run last 4 weekly findings against new judge for backfill, confirm no regression.
Cost of the swap: ~5 hours. Value: confidence in every future audit output went from “I should manually verify ~half” to “I trust the surfaced findings”.
Lesson encoded as Marc-rule: production LLM judge default = Grok 4.3 per dojo eval, NOT Haiku. Defaulting to Haiku because of SDK familiarity is a real bias, has happened twice.
Reference links
- The triggering blog post: Why your AI memory lies to you
- The production-swap blog post: Knowledge-Audit for RAG-KB: From 80% to 99% with Grok 4.3
- Personal-RAG (substrate): /projects/personal-rag-kb
- Eval-Framework (verified the swap): /projects/eval-framework
- launchd
RunAtLoad: https://developer.apple.com/documentation/xpc/xpc_services_xpc_launchd_plist - Anthropic Claude API: https://docs.anthropic.com/
- xAI Grok API: https://docs.x.ai/
Working-session log
| Date | Hours | What | Outcome |
|---|---|---|---|
| 2025-11-08 | ~8 h | Initial build (3 layers + weekly cron + Telegram + PII redaction) | Day-one deploy, 25+ findings |
| 2025-11-09 | ~3 h | Hardening (launchd catch-up, fingerprint suppression, more redactions) | Stable weekly cadence |
| 2025-11-15 | ~2 h | 4-tier cron split (daily/weekly/monthly + event) | Multi-cadence shipped |
| 2026-01-x | ~1 h | Deprecate fingerprint suppression, switch to in-source comments | More durable, lower maintenance |
| 2026-02-x | ~2 h | Post-commit hook on KB-s3 mount (event tier) | Sub-minute drift detection |
| 2026-05-20 | ~1 h | Corpus profile measurement | Confirmed Qwen + bge-m3 stack |
| 2026-05-23 | ~5 h | Grok 4.3 production swap (Eval-Framework bake-off + ship same day) | 80% → 99% real accuracy |
| 2026-05-24 | ~3 h | Auto-fix policy + safety heuristics + ADHD format finalize | Weekly/monthly hands-off mode |
| Total | ~25 h | — | Production-ready, $5/mo |