Implementation

Sister docs: PRD (intent), Architecture (system view), Notes (decision log).

TL;DR

A production cross-source audit in continuous personal use since 2025-11-08:

3 detection layers (intra-file static / cross-file LLM / cross-workspace LLM)
4 launchd tiers (daily / weekly / monthly / event-triggered)
Production judge: Grok 4.3 (swapped from Haiku 4.5 on 2026-05-23) — real accuracy 80% → 99% after Eval-Framework bake-off
Local-first: Qwen 4B/8B/32B on M2 Max MPS, Anthropic Haiku 4.5 as verifier, Grok 4.3 as judge
Auto-fix policy: Grok propose → safety heuristics → Grok 4.3 judge → git snapshot → apply
Cost ~$5/month all tiers combined
Output: Telegram digest, ADHD-friendly, bundled 2×/day, severity 🟢🟡🔴 + action verb + time-box

Stack

Layer	Component	Version / Notes
Compute	MacBook Pro M2 Max	local; shares host with Personal-RAG
OS	macOS	launchd-managed jobs (`ai.knowledge-audit.*`)
Runtime	Python	3.11 + venv
Local LLM	`mlx-lm` running Qwen 4B / 8B / 32B Q4	MPS-accelerated; tier-specific size
Verifier (borderline)	Anthropic Haiku 4.5	`claude-haiku-4-5`
Production judge	xAI Grok 4.3	`grok-4-3`, verified 99% on Eval-Framework holdout-99
Retrieval	Personal-RAG (`bge-m3` + Postgres 16 + pgvector)	re-used; no duplicate index
Delivery	Telegram Bot API	inline triage buttons
Snapshot	git	`git stash --include-untracked` before apply
Scheduler	launchd	4 plists: daily, weekly, monthly, event-watcher

Directory layout

~/.claude/hooks/audit-knowledge/
├── audit.py                       # main driver
├── tiers.py                       # tier config (daily/weekly/monthly/event)
├── layer1_static.py               # intra-file mechanical checks
├── layer2_cross_file.py           # cross-file LLM scan
├── layer3_cross_workspace.py      # cross-workspace LLM scan
├── judge.py                       # Grok 4.3 verifier
├── autofix/
│   ├── propose.py                 # Grok patch generator
│   ├── safety.py                  # heuristics filter
│   └── apply.py                   # git snapshot + apply
├── delivery/
│   ├── telegram.py                # digest formatter + sender
│   └── format_adhd.py             # severity tag + action verb + time-box
├── redact.py                      # PII regex patterns (12 patterns + email partial)
├── prompts/
│   ├── detector_l2.txt
│   ├── detector_l3.txt
│   ├── judge.txt
│   └── propose_fix.txt
├── state/
│   ├── last_run.json              # skip-if-recent dedupe
│   ├── suppressions.json          # legacy fingerprint suppression (deprecated)
│   └── digest_buffer.json         # bundling state for 2×/day delivery
└── tests/
    └── test_redact.py             # 6 unit cases, 100% pass

~/Library/LaunchAgents/
├── ai.knowledge-audit.daily.plist
├── ai.knowledge-audit.weekly.plist
├── ai.knowledge-audit.monthly.plist
└── ai.knowledge-audit.event-watcher.plist

Layer 1 — static checks

Mechanical claim extractor with per-type verifier:

Claim type	Detector	Verifier
`path`	regex `(?P (?:~	/)[\w./-]+.\w+)`
`port`	regex `\bport\s+(\d+)\b`	`1 <= int <= 65535`
`ip`	regex IPv4	`socket.inet_aton`
`url`	regex `https?://[^\s)]+`	format-only (Layer 1) — HTTP probe is the auto-probe manual layer
`version`	regex `Python\s+3\.\d+` etc.	product-specific (`python3 --version` etc.)

Cost: $0. Catches mechanical drift the LLM doesn’t need to think about.

Layer 2 — cross-file LLM scan (within workspace)

Bundle all files in scope into one prompt, ask the model to find contradictions:

## CATEGORY: memory
### file_a.md
<content>
### file_b.md
<content>

## CATEGORY: workspace_claude_md
### CLAUDE.md
<content>

## CATEGORY: project_notes
### project_X/NOTES.md
<content>

Find facts that contradict each other across files, or that look stale.
Output JSON array with: { file, level, type, detail, evidence_chain, confidence }.

Rules:
- "level" ∈ {RED, YELLOW}
  - RED: directly contradicts another source with verbatim quote
  - YELLOW: looks dated/unverified
- "type" ∈ {cross_source_mismatch, stale_fact, broken_assumption}
- "evidence_chain" MUST include verbatim quotes (primary_quote + conflicting_quote)
- "confidence" 0-100

The detector returns raw findings. The judge layer (Grok 4.3) then verifies each one is real before surfacing.

Layer 3 — cross-workspace (monthly only)

Same prompt shape, wider context: _personal + _shared + workspace CLAUDE.md bundled. Uses Qwen 32B alone (no Haiku verifier — judge handles it). Longest context window of all tiers.

This is the layer that caught the Oracle A1 hallucination (workspace CLAUDE.md claimed A1 live, personal memory said “couldn’t register”).

The judge prompt (production)

You are verifying whether a flagged finding is a REAL contradiction
or a false positive.

FINDING:
{finding_json}

PRIMARY SOURCE EXCERPT:
{primary_excerpt}

CONFLICTING SOURCE EXCERPT:
{conflicting_excerpt}

Decide:
1. Is the contradiction real (both sources cited are actually present,
   and they actually disagree)?
2. Is the disagreement on the SAME fact (not different things that
   look similar)?
3. Is the more-recent source contradicting the older one
   (genuine drift, not intentional version note)?

Respond with JSON:
{
  "verdict": "real" | "false_positive" | "ambiguous",
  "reason": "<one sentence>",
  "verified_evidence": <verbatim quote that proves the contradiction, or null>
}

This is the prompt that took accuracy from 80% (strict-match scorer on Haiku) to 99% (LLM-judged real with Grok 4.3) on the Eval-Framework holdout-99 set.

Production swap: Haiku 4.5 → Grok 4.3 (2026-05-23)

The Eval-Framework bake-off measured 5 backends on the same 99-finding holdout. Strict-substring scoring was misleading — it counted alternate valid findings as failures (the “Eval Scorer Strict Trap”). LLM-judged real-accuracy was the right metric:

Backend	Strict	LLM-judged	Cost @ 15K audits/mo
Haiku 4.5 alone (as judge)	13%	13%	$60
Grok 4.3 + Sonnet 4.6 judge (shipped)	80%	99%	$90 (SMB scale) / ~$0.61/mo (personal)
Sonnet 4.6 alone	60%	60%	$140
Opus 4.7 alone	67%	67%	$285
SaaS audit tools	—	—	$500–2,000

Grok 4.3 beat premium models at 1/3 the cost. Build cost 5–20× cheaper than SaaS audit tools.

Auto-fix safety heuristics

A patch passes only if all are true:

Patch is ≤ 5 lines
File extension ∈ {.md, .txt, .yaml, .json}
File path does NOT match high-risk set: **/credentials*, **/secrets*, **/*.tfstate, **/cloudflared/config.yml, anything under ~/Documents/KB/_secrets/
Patch touches exactly one file
No regex of the form (URL, port, IP, hostname) is being changed unless the corresponding manual probe was registered + passed against the new value
No mention of “remove”, “delete”, rm , DROP , in either old or new content
Grok 4.3 judge confirms the patch resolves the finding without regression

If any fail → skip auto-fix, surface to Telegram for manual triage.

Telegram digest format

🔴 RED · 2026-05-24 morning bundle (3 findings)

1. Fix Postgres version in memory  · ~2 min
   memory/personal_rag_hardware.md:14 says "Postgres 14"
   but project NOTES says "Postgres 16" (updated 2026-05-04).
   ✅ Accept  ⏸ Snooze 7d  ❌ Reject

2. Update Oracle A1 status  · ~3 min
   CLAUDE.md says "A1 VM live" but memory/mail_watcher_project.md
   says "couldn't register A1, blocked".
   ✅ Accept  ⏸ Snooze 7d  ❌ Reject

🟡 YELLOW (5)  ▼ tap to expand

Severity emoji first
Action verb opens each line (no “found that…”)
Time-box estimate
Inline triage buttons
Bundle 2×/day (08:00 + 18:00 local)

PII redaction (mandatory before any LLM call)

REDACTIONS = [
    (re.compile(r"sk-ant-[a-zA-Z0-9_\-]{20,}"), "[REDACTED:anthropic_key]"),
    (re.compile(r"\bsk-(?!ant-)[A-Za-z0-9]{20,}"), "[REDACTED:openai_key]"),
    (re.compile(r"xai-[A-Za-z0-9]{40,}"), "[REDACTED:xai_key]"),
    (re.compile(r"ghp_[A-Za-z0-9]{36,}"), "[REDACTED:github_pat]"),
    (re.compile(r"AKIA[0-9A-Z]{16}"), "[REDACTED:aws_access_key]"),
    (re.compile(r"-----BEGIN[A-Z ]+PRIVATE KEY-----[\s\S]+?-----END[A-Z ]+PRIVATE KEY-----"),
     "[REDACTED:private_key]"),
    # ... 12+ patterns total
]

For credentials_vault.md and high-risk files: also partial-redact emails ([email protected] → [user]@example.com — keep domain for cross-source diff signal).

6 unit tests on test_redact.py, 100% pass, runs in CI.

launchd catch-up logic

macOS cron doesn’t fire missed runs after sleep/off; launchd does — with the right shape:

<key>StartCalendarInterval</key>
<dict>
    <key>Hour</key><integer>3</integer>
    <key>Minute</key><integer>1</integer>
</dict>
<key>RunAtLoad</key><true/>

Combined with state-file dedupe in audit.py:

last_run_iso = state.get(tier, {}).get("last_run")
if last_run_iso:
    hours_since = (now - parse(last_run_iso)).total_seconds() / 3600
    if hours_since < TIER_MIN_GAP_HOURS[tier]:
        log(f"SKIP: tier={tier} ran {hours_since:.1f}h ago")
        sys.exit(0)

Defense-in-depth: RunAtLoad fires on boot/wake → state file prevents duplicate runs in the same window.

Performance numbers

Measured on MacBook Pro M2 Max:

Operation	Number	Notes
Daily run wall time	~45 s	~10 changed files, Qwen 4B + ~3 Haiku verifications
Weekly run wall time	~6 min	full _personal workspace (~80 files)
Monthly run wall time	~14 min	+ cross-workspace, Qwen 32B
Event run wall time	~8 s	scoped to 1–3 files in commit
Judge call (Grok 4.3)	~2.1 s p95	per finding
Auto-fix apply (incl snapshot)	~1.4 s	per patch
Cost per daily run	~$0.002	mostly Grok judge
Cost per weekly run	~$0.07	larger context
Cost per monthly run	~$0.18	Layer 3 + bigger context
Monthly all-in	~$5.00	All tiers + judge + Telegram free
Real accuracy on holdout-99	99%	LLM-judged (Grok 4.3 judge)
False positive rate	~12%	measured on 50-finding sample

Reliability features

Feature	How
Catch-up after Mac off/sleep	`RunAtLoad: true` + state-file skip-if-recent
Idempotent tier dedupe	per-tier last-run timestamp in `state/last_run.json`
PII never sent to LLM	12-pattern regex + email partial-redact + 6-test CI gate
Auto-fix reversible	`git stash --include-untracked` snapshot before apply
LLM API failover	Grok 4.3 → Haiku 4.5 verifier (lower precision but still surfaces findings)
Telegram delivery durability	retry queue persisted to disk, drains on next run
Local Qwen OOM	mlx-lm watchdog + smaller-batch retry
Finding deduplication	(deprecated fingerprint suppression) → in-source explanatory comments

Security model

Threat	Mitigation
Secrets leak to Anthropic/xAI API	12-pattern regex redaction + email partial + 6 unit tests in CI
Auto-fix overwrites credentials	high-risk file glob list excluded; safety heuristics reject patches under `_secrets/`
Auto-fix mis-applies	git snapshot taken before each apply; one `git revert <sha>` to undo
Telegram token theft	bot token in `~/.config/knowledge-audit/secrets.env`, chmod 600
Local LLM exfil	Qwen runs offline; no outbound; only cloud calls are explicit (Haiku verify, Grok judge) — both go through redact.py first

Reproducibility — quickstart for a forker

# 1. Personal-RAG must already be running (or any bge-m3 + pgvector index)

# 2. Clone audit
cd ~/.claude/hooks/
git clone <your-fork>/audit-knowledge && cd audit-knowledge
python3.11 -m venv venv
./venv/bin/pip install -r requirements.txt  # mlx-lm, anthropic, xai-grok-sdk,
                                            # python-telegram-bot, pytest

# 3. Download Qwen quantized weights (one-time)
huggingface-cli download mlx-community/Qwen2.5-4B-Instruct-4bit
huggingface-cli download mlx-community/Qwen2.5-8B-Instruct-4bit
huggingface-cli download mlx-community/Qwen2.5-32B-Instruct-4bit

# 4. Configure secrets
cat > ~/.config/knowledge-audit/secrets.env <<EOF
ANTHROPIC_API_KEY=sk-ant-...
XAI_API_KEY=xai-...
TELEGRAM_BOT_TOKEN=...
TELEGRAM_CHAT_ID=...
EOF
chmod 600 ~/.config/knowledge-audit/secrets.env

# 5. Install launchd plists
cp launchd/*.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/ai.knowledge-audit.*

# 6. First run (force daily tier ignoring state)
./venv/bin/python audit.py --tier=daily --force

# 7. Configure post-commit hook on KB-s3 mount
ln -s ~/.claude/hooks/audit-knowledge/event_hook.sh \
      ~/Documents/KB-s3/.git/hooks/post-commit

# 8. Verify Telegram digest arrives

Future work

Reverse audit — scan codebase for facts that should be in memory but aren’t (proactive bootstrap)
Adversarial test corpus — synthetic stale facts with known answers, measure precision/recall as a benchmark
Drift trend chart — sparkline of weekly RED count in digest header
Per-workspace judge tuning — different prompts for _personal vs LL (different style of contradictions)

License & attribution

Personal project. Built on:

Personal-RAG (own infrastructure)
Eval-Framework (own infrastructure)
mlx-lm by Apple
Qwen 2.5 by Alibaba
Anthropic Claude (Haiku verifier)
xAI Grok 4.3 (production judge)
python-telegram-bot

Knowledge-Audit — Implementation