Implementation
Sister docs: PRD (intent), Architecture (system view), Notes (decision log).
TL;DR
A production cross-source audit in continuous personal use since 2025-11-08:
- 3 detection layers (intra-file static / cross-file LLM / cross-workspace LLM)
- 4 launchd tiers (daily / weekly / monthly / event-triggered)
- Production judge: Grok 4.3 (swapped from Haiku 4.5 on 2026-05-23) — real accuracy 80% → 99% after Eval-Framework bake-off
- Local-first: Qwen 4B/8B/32B on M2 Max MPS, Anthropic Haiku 4.5 as verifier, Grok 4.3 as judge
- Auto-fix policy: Grok propose → safety heuristics → Grok 4.3 judge → git snapshot → apply
- Cost ~$5/month all tiers combined
- Output: Telegram digest, ADHD-friendly, bundled 2×/day, severity 🟢🟡🔴 + action verb + time-box
Stack
| Layer | Component | Version / Notes |
|---|---|---|
| Compute | MacBook Pro M2 Max | local; shares host with Personal-RAG |
| OS | macOS | launchd-managed jobs (ai.knowledge-audit.*) |
| Runtime | Python | 3.11 + venv |
| Local LLM | mlx-lm running Qwen 4B / 8B / 32B Q4 | MPS-accelerated; tier-specific size |
| Verifier (borderline) | Anthropic Haiku 4.5 | claude-haiku-4-5 |
| Production judge | xAI Grok 4.3 | grok-4-3, verified 99% on Eval-Framework holdout-99 |
| Retrieval | Personal-RAG (bge-m3 + Postgres 16 + pgvector) | re-used; no duplicate index |
| Delivery | Telegram Bot API | inline triage buttons |
| Snapshot | git | git stash --include-untracked before apply |
| Scheduler | launchd | 4 plists: daily, weekly, monthly, event-watcher |
Directory layout
~/.claude/hooks/audit-knowledge/
├── audit.py # main driver
├── tiers.py # tier config (daily/weekly/monthly/event)
├── layer1_static.py # intra-file mechanical checks
├── layer2_cross_file.py # cross-file LLM scan
├── layer3_cross_workspace.py # cross-workspace LLM scan
├── judge.py # Grok 4.3 verifier
├── autofix/
│ ├── propose.py # Grok patch generator
│ ├── safety.py # heuristics filter
│ └── apply.py # git snapshot + apply
├── delivery/
│ ├── telegram.py # digest formatter + sender
│ └── format_adhd.py # severity tag + action verb + time-box
├── redact.py # PII regex patterns (12 patterns + email partial)
├── prompts/
│ ├── detector_l2.txt
│ ├── detector_l3.txt
│ ├── judge.txt
│ └── propose_fix.txt
├── state/
│ ├── last_run.json # skip-if-recent dedupe
│ ├── suppressions.json # legacy fingerprint suppression (deprecated)
│ └── digest_buffer.json # bundling state for 2×/day delivery
└── tests/
└── test_redact.py # 6 unit cases, 100% pass
~/Library/LaunchAgents/
├── ai.knowledge-audit.daily.plist
├── ai.knowledge-audit.weekly.plist
├── ai.knowledge-audit.monthly.plist
└── ai.knowledge-audit.event-watcher.plist
Layer 1 — static checks
Mechanical claim extractor with per-type verifier:
| Claim type | Detector | Verifier |
|---|---|---|
path | regex `(?P (?:~ | /)[\w./-]+.\w+)` |
port | regex \bport\s+(\d+)\b | 1 <= int <= 65535 |
ip | regex IPv4 | socket.inet_aton |
url | regex https?://[^\s)]+ | format-only (Layer 1) — HTTP probe is the auto-probe manual layer |
version | regex Python\s+3\.\d+ etc. | product-specific (python3 --version etc.) |
Cost: $0. Catches mechanical drift the LLM doesn’t need to think about.
Layer 2 — cross-file LLM scan (within workspace)
Bundle all files in scope into one prompt, ask the model to find contradictions:
## CATEGORY: memory
### file_a.md
<content>
### file_b.md
<content>
## CATEGORY: workspace_claude_md
### CLAUDE.md
<content>
## CATEGORY: project_notes
### project_X/NOTES.md
<content>
Find facts that contradict each other across files, or that look stale.
Output JSON array with: { file, level, type, detail, evidence_chain, confidence }.
Rules:
- "level" ∈ {RED, YELLOW}
- RED: directly contradicts another source with verbatim quote
- YELLOW: looks dated/unverified
- "type" ∈ {cross_source_mismatch, stale_fact, broken_assumption}
- "evidence_chain" MUST include verbatim quotes (primary_quote + conflicting_quote)
- "confidence" 0-100
The detector returns raw findings. The judge layer (Grok 4.3) then verifies each one is real before surfacing.
Layer 3 — cross-workspace (monthly only)
Same prompt shape, wider context: _personal + _shared + workspace CLAUDE.md bundled. Uses Qwen 32B alone (no Haiku verifier — judge handles it). Longest context window of all tiers.
This is the layer that caught the Oracle A1 hallucination (workspace CLAUDE.md claimed A1 live, personal memory said “couldn’t register”).
The judge prompt (production)
You are verifying whether a flagged finding is a REAL contradiction
or a false positive.
FINDING:
{finding_json}
PRIMARY SOURCE EXCERPT:
{primary_excerpt}
CONFLICTING SOURCE EXCERPT:
{conflicting_excerpt}
Decide:
1. Is the contradiction real (both sources cited are actually present,
and they actually disagree)?
2. Is the disagreement on the SAME fact (not different things that
look similar)?
3. Is the more-recent source contradicting the older one
(genuine drift, not intentional version note)?
Respond with JSON:
{
"verdict": "real" | "false_positive" | "ambiguous",
"reason": "<one sentence>",
"verified_evidence": <verbatim quote that proves the contradiction, or null>
}
This is the prompt that took accuracy from 80% (strict-match scorer on Haiku) to 99% (LLM-judged real with Grok 4.3) on the Eval-Framework holdout-99 set.
Production swap: Haiku 4.5 → Grok 4.3 (2026-05-23)
The Eval-Framework bake-off measured 5 backends on the same 99-finding holdout. Strict-substring scoring was misleading — it counted alternate valid findings as failures (the “Eval Scorer Strict Trap”). LLM-judged real-accuracy was the right metric:
| Backend | Strict | LLM-judged | Cost @ 15K audits/mo |
|---|---|---|---|
| Haiku 4.5 alone (as judge) | 13% | 13% | $60 |
| Grok 4.3 + Sonnet 4.6 judge (shipped) | 80% | 99% | $90 (SMB scale) / ~$0.61/mo (personal) |
| Sonnet 4.6 alone | 60% | 60% | $140 |
| Opus 4.7 alone | 67% | 67% | $285 |
| SaaS audit tools | — | — | $500–2,000 |
Grok 4.3 beat premium models at 1/3 the cost. Build cost 5–20× cheaper than SaaS audit tools.
Auto-fix safety heuristics
A patch passes only if all are true:
- Patch is ≤ 5 lines
- File extension ∈ {
.md,.txt,.yaml,.json} - File path does NOT match high-risk set:
**/credentials*,**/secrets*,**/*.tfstate,**/cloudflared/config.yml, anything under~/Documents/KB/_secrets/ - Patch touches exactly one file
- No regex of the form (URL, port, IP, hostname) is being changed unless the corresponding manual probe was registered + passed against the new value
- No mention of “remove”, “delete”,
rm,DROP, in either old or new content - Grok 4.3 judge confirms the patch resolves the finding without regression
If any fail → skip auto-fix, surface to Telegram for manual triage.
Telegram digest format
🔴 RED · 2026-05-24 morning bundle (3 findings)
1. Fix Postgres version in memory · ~2 min
memory/personal_rag_hardware.md:14 says "Postgres 14"
but project NOTES says "Postgres 16" (updated 2026-05-04).
✅ Accept ⏸ Snooze 7d ❌ Reject
2. Update Oracle A1 status · ~3 min
CLAUDE.md says "A1 VM live" but memory/mail_watcher_project.md
says "couldn't register A1, blocked".
✅ Accept ⏸ Snooze 7d ❌ Reject
🟡 YELLOW (5) ▼ tap to expand
- Severity emoji first
- Action verb opens each line (no “found that…”)
- Time-box estimate
- Inline triage buttons
- Bundle 2×/day (08:00 + 18:00 local)
PII redaction (mandatory before any LLM call)
REDACTIONS = [
(re.compile(r"sk-ant-[a-zA-Z0-9_\-]{20,}"), "[REDACTED:anthropic_key]"),
(re.compile(r"\bsk-(?!ant-)[A-Za-z0-9]{20,}"), "[REDACTED:openai_key]"),
(re.compile(r"xai-[A-Za-z0-9]{40,}"), "[REDACTED:xai_key]"),
(re.compile(r"ghp_[A-Za-z0-9]{36,}"), "[REDACTED:github_pat]"),
(re.compile(r"AKIA[0-9A-Z]{16}"), "[REDACTED:aws_access_key]"),
(re.compile(r"-----BEGIN[A-Z ]+PRIVATE KEY-----[\s\S]+?-----END[A-Z ]+PRIVATE KEY-----"),
"[REDACTED:private_key]"),
# ... 12+ patterns total
]
For credentials_vault.md and high-risk files: also partial-redact emails ([email protected] → [user]@example.com — keep domain for cross-source diff signal).
6 unit tests on test_redact.py, 100% pass, runs in CI.
launchd catch-up logic
macOS cron doesn’t fire missed runs after sleep/off; launchd does — with the right shape:
<key>StartCalendarInterval</key>
<dict>
<key>Hour</key><integer>3</integer>
<key>Minute</key><integer>1</integer>
</dict>
<key>RunAtLoad</key><true/>
Combined with state-file dedupe in audit.py:
last_run_iso = state.get(tier, {}).get("last_run")
if last_run_iso:
hours_since = (now - parse(last_run_iso)).total_seconds() / 3600
if hours_since < TIER_MIN_GAP_HOURS[tier]:
log(f"SKIP: tier={tier} ran {hours_since:.1f}h ago")
sys.exit(0)
Defense-in-depth: RunAtLoad fires on boot/wake → state file prevents duplicate runs in the same window.
Performance numbers
Measured on MacBook Pro M2 Max:
| Operation | Number | Notes |
|---|---|---|
| Daily run wall time | ~45 s | ~10 changed files, Qwen 4B + ~3 Haiku verifications |
| Weekly run wall time | ~6 min | full _personal workspace (~80 files) |
| Monthly run wall time | ~14 min | + cross-workspace, Qwen 32B |
| Event run wall time | ~8 s | scoped to 1–3 files in commit |
| Judge call (Grok 4.3) | ~2.1 s p95 | per finding |
| Auto-fix apply (incl snapshot) | ~1.4 s | per patch |
| Cost per daily run | ~$0.002 | mostly Grok judge |
| Cost per weekly run | ~$0.07 | larger context |
| Cost per monthly run | ~$0.18 | Layer 3 + bigger context |
| Monthly all-in | ~$5.00 | All tiers + judge + Telegram free |
| Real accuracy on holdout-99 | 99% | LLM-judged (Grok 4.3 judge) |
| False positive rate | ~12% | measured on 50-finding sample |
Reliability features
| Feature | How |
|---|---|
| Catch-up after Mac off/sleep | RunAtLoad: true + state-file skip-if-recent |
| Idempotent tier dedupe | per-tier last-run timestamp in state/last_run.json |
| PII never sent to LLM | 12-pattern regex + email partial-redact + 6-test CI gate |
| Auto-fix reversible | git stash --include-untracked snapshot before apply |
| LLM API failover | Grok 4.3 → Haiku 4.5 verifier (lower precision but still surfaces findings) |
| Telegram delivery durability | retry queue persisted to disk, drains on next run |
| Local Qwen OOM | mlx-lm watchdog + smaller-batch retry |
| Finding deduplication | (deprecated fingerprint suppression) → in-source explanatory comments |
Security model
| Threat | Mitigation |
|---|---|
| Secrets leak to Anthropic/xAI API | 12-pattern regex redaction + email partial + 6 unit tests in CI |
| Auto-fix overwrites credentials | high-risk file glob list excluded; safety heuristics reject patches under _secrets/ |
| Auto-fix mis-applies | git snapshot taken before each apply; one git revert <sha> to undo |
| Telegram token theft | bot token in ~/.config/knowledge-audit/secrets.env, chmod 600 |
| Local LLM exfil | Qwen runs offline; no outbound; only cloud calls are explicit (Haiku verify, Grok judge) — both go through redact.py first |
Reproducibility — quickstart for a forker
# 1. Personal-RAG must already be running (or any bge-m3 + pgvector index)
# 2. Clone audit
cd ~/.claude/hooks/
git clone <your-fork>/audit-knowledge && cd audit-knowledge
python3.11 -m venv venv
./venv/bin/pip install -r requirements.txt # mlx-lm, anthropic, xai-grok-sdk,
# python-telegram-bot, pytest
# 3. Download Qwen quantized weights (one-time)
huggingface-cli download mlx-community/Qwen2.5-4B-Instruct-4bit
huggingface-cli download mlx-community/Qwen2.5-8B-Instruct-4bit
huggingface-cli download mlx-community/Qwen2.5-32B-Instruct-4bit
# 4. Configure secrets
cat > ~/.config/knowledge-audit/secrets.env <<EOF
ANTHROPIC_API_KEY=sk-ant-...
XAI_API_KEY=xai-...
TELEGRAM_BOT_TOKEN=...
TELEGRAM_CHAT_ID=...
EOF
chmod 600 ~/.config/knowledge-audit/secrets.env
# 5. Install launchd plists
cp launchd/*.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/ai.knowledge-audit.*
# 6. First run (force daily tier ignoring state)
./venv/bin/python audit.py --tier=daily --force
# 7. Configure post-commit hook on KB-s3 mount
ln -s ~/.claude/hooks/audit-knowledge/event_hook.sh \
~/Documents/KB-s3/.git/hooks/post-commit
# 8. Verify Telegram digest arrives
Future work
- Reverse audit — scan codebase for facts that should be in memory but aren’t (proactive bootstrap)
- Adversarial test corpus — synthetic stale facts with known answers, measure precision/recall as a benchmark
- Drift trend chart — sparkline of weekly RED count in digest header
- Per-workspace judge tuning — different prompts for
_personalvs LL (different style of contradictions)
License & attribution
Personal project. Built on:
- Personal-RAG (own infrastructure)
- Eval-Framework (own infrastructure)
- mlx-lm by Apple
- Qwen 2.5 by Alibaba
- Anthropic Claude (Haiku verifier)
- xAI Grok 4.3 (production judge)
- python-telegram-bot