← Back to project
● Shipped P0 Size M Foundation

Knowledge-Audit — Implementation

Code structure, prompts, judge prompt, perf numbers, security model, reproducibility steps.

Implementation

Sister docs: PRD (intent), Architecture (system view), Notes (decision log).

TL;DR

A production cross-source audit in continuous personal use since 2025-11-08:

  • 3 detection layers (intra-file static / cross-file LLM / cross-workspace LLM)
  • 4 launchd tiers (daily / weekly / monthly / event-triggered)
  • Production judge: Grok 4.3 (swapped from Haiku 4.5 on 2026-05-23) — real accuracy 80% → 99% after Eval-Framework bake-off
  • Local-first: Qwen 4B/8B/32B on M2 Max MPS, Anthropic Haiku 4.5 as verifier, Grok 4.3 as judge
  • Auto-fix policy: Grok propose → safety heuristics → Grok 4.3 judge → git snapshot → apply
  • Cost ~$5/month all tiers combined
  • Output: Telegram digest, ADHD-friendly, bundled 2×/day, severity 🟢🟡🔴 + action verb + time-box

Stack

LayerComponentVersion / Notes
ComputeMacBook Pro M2 Maxlocal; shares host with Personal-RAG
OSmacOSlaunchd-managed jobs (ai.knowledge-audit.*)
RuntimePython3.11 + venv
Local LLMmlx-lm running Qwen 4B / 8B / 32B Q4MPS-accelerated; tier-specific size
Verifier (borderline)Anthropic Haiku 4.5claude-haiku-4-5
Production judgexAI Grok 4.3grok-4-3, verified 99% on Eval-Framework holdout-99
RetrievalPersonal-RAG (bge-m3 + Postgres 16 + pgvector)re-used; no duplicate index
DeliveryTelegram Bot APIinline triage buttons
Snapshotgitgit stash --include-untracked before apply
Schedulerlaunchd4 plists: daily, weekly, monthly, event-watcher

Directory layout

~/.claude/hooks/audit-knowledge/
├── audit.py                       # main driver
├── tiers.py                       # tier config (daily/weekly/monthly/event)
├── layer1_static.py               # intra-file mechanical checks
├── layer2_cross_file.py           # cross-file LLM scan
├── layer3_cross_workspace.py      # cross-workspace LLM scan
├── judge.py                       # Grok 4.3 verifier
├── autofix/
│   ├── propose.py                 # Grok patch generator
│   ├── safety.py                  # heuristics filter
│   └── apply.py                   # git snapshot + apply
├── delivery/
│   ├── telegram.py                # digest formatter + sender
│   └── format_adhd.py             # severity tag + action verb + time-box
├── redact.py                      # PII regex patterns (12 patterns + email partial)
├── prompts/
│   ├── detector_l2.txt
│   ├── detector_l3.txt
│   ├── judge.txt
│   └── propose_fix.txt
├── state/
│   ├── last_run.json              # skip-if-recent dedupe
│   ├── suppressions.json          # legacy fingerprint suppression (deprecated)
│   └── digest_buffer.json         # bundling state for 2×/day delivery
└── tests/
    └── test_redact.py             # 6 unit cases, 100% pass

~/Library/LaunchAgents/
├── ai.knowledge-audit.daily.plist
├── ai.knowledge-audit.weekly.plist
├── ai.knowledge-audit.monthly.plist
└── ai.knowledge-audit.event-watcher.plist

Layer 1 — static checks

Mechanical claim extractor with per-type verifier:

Claim typeDetectorVerifier
pathregex `(?P

(?:~

/)[\w./-]+.\w+)`
portregex \bport\s+(\d+)\b1 <= int <= 65535
ipregex IPv4socket.inet_aton
urlregex https?://[^\s)]+format-only (Layer 1) — HTTP probe is the auto-probe manual layer
versionregex Python\s+3\.\d+ etc.product-specific (python3 --version etc.)

Cost: $0. Catches mechanical drift the LLM doesn’t need to think about.

Layer 2 — cross-file LLM scan (within workspace)

Bundle all files in scope into one prompt, ask the model to find contradictions:

## CATEGORY: memory
### file_a.md
<content>
### file_b.md
<content>

## CATEGORY: workspace_claude_md
### CLAUDE.md
<content>

## CATEGORY: project_notes
### project_X/NOTES.md
<content>

Find facts that contradict each other across files, or that look stale.
Output JSON array with: { file, level, type, detail, evidence_chain, confidence }.

Rules:
- "level" ∈ {RED, YELLOW}
  - RED: directly contradicts another source with verbatim quote
  - YELLOW: looks dated/unverified
- "type" ∈ {cross_source_mismatch, stale_fact, broken_assumption}
- "evidence_chain" MUST include verbatim quotes (primary_quote + conflicting_quote)
- "confidence" 0-100

The detector returns raw findings. The judge layer (Grok 4.3) then verifies each one is real before surfacing.

Layer 3 — cross-workspace (monthly only)

Same prompt shape, wider context: _personal + _shared + workspace CLAUDE.md bundled. Uses Qwen 32B alone (no Haiku verifier — judge handles it). Longest context window of all tiers.

This is the layer that caught the Oracle A1 hallucination (workspace CLAUDE.md claimed A1 live, personal memory said “couldn’t register”).

The judge prompt (production)

You are verifying whether a flagged finding is a REAL contradiction
or a false positive.

FINDING:
{finding_json}

PRIMARY SOURCE EXCERPT:
{primary_excerpt}

CONFLICTING SOURCE EXCERPT:
{conflicting_excerpt}

Decide:
1. Is the contradiction real (both sources cited are actually present,
   and they actually disagree)?
2. Is the disagreement on the SAME fact (not different things that
   look similar)?
3. Is the more-recent source contradicting the older one
   (genuine drift, not intentional version note)?

Respond with JSON:
{
  "verdict": "real" | "false_positive" | "ambiguous",
  "reason": "<one sentence>",
  "verified_evidence": <verbatim quote that proves the contradiction, or null>
}

This is the prompt that took accuracy from 80% (strict-match scorer on Haiku) to 99% (LLM-judged real with Grok 4.3) on the Eval-Framework holdout-99 set.

Production swap: Haiku 4.5 → Grok 4.3 (2026-05-23)

The Eval-Framework bake-off measured 5 backends on the same 99-finding holdout. Strict-substring scoring was misleading — it counted alternate valid findings as failures (the “Eval Scorer Strict Trap”). LLM-judged real-accuracy was the right metric:

BackendStrictLLM-judgedCost @ 15K audits/mo
Haiku 4.5 alone (as judge)13%13%$60
Grok 4.3 + Sonnet 4.6 judge (shipped)80%99%$90 (SMB scale) / ~$0.61/mo (personal)
Sonnet 4.6 alone60%60%$140
Opus 4.7 alone67%67%$285
SaaS audit tools$500–2,000

Grok 4.3 beat premium models at 1/3 the cost. Build cost 5–20× cheaper than SaaS audit tools.

Auto-fix safety heuristics

A patch passes only if all are true:

  1. Patch is ≤ 5 lines
  2. File extension ∈ {.md, .txt, .yaml, .json}
  3. File path does NOT match high-risk set: **/credentials*, **/secrets*, **/*.tfstate, **/cloudflared/config.yml, anything under ~/Documents/KB/_secrets/
  4. Patch touches exactly one file
  5. No regex of the form (URL, port, IP, hostname) is being changed unless the corresponding manual probe was registered + passed against the new value
  6. No mention of “remove”, “delete”, rm , DROP , in either old or new content
  7. Grok 4.3 judge confirms the patch resolves the finding without regression

If any fail → skip auto-fix, surface to Telegram for manual triage.

Telegram digest format

🔴 RED · 2026-05-24 morning bundle (3 findings)

1. Fix Postgres version in memory  · ~2 min
   memory/personal_rag_hardware.md:14 says "Postgres 14"
   but project NOTES says "Postgres 16" (updated 2026-05-04).
   ✅ Accept  ⏸ Snooze 7d  ❌ Reject

2. Update Oracle A1 status  · ~3 min
   CLAUDE.md says "A1 VM live" but memory/mail_watcher_project.md
   says "couldn't register A1, blocked".
   ✅ Accept  ⏸ Snooze 7d  ❌ Reject

🟡 YELLOW (5)  ▼ tap to expand
  • Severity emoji first
  • Action verb opens each line (no “found that…”)
  • Time-box estimate
  • Inline triage buttons
  • Bundle 2×/day (08:00 + 18:00 local)

PII redaction (mandatory before any LLM call)

REDACTIONS = [
    (re.compile(r"sk-ant-[a-zA-Z0-9_\-]{20,}"), "[REDACTED:anthropic_key]"),
    (re.compile(r"\bsk-(?!ant-)[A-Za-z0-9]{20,}"), "[REDACTED:openai_key]"),
    (re.compile(r"xai-[A-Za-z0-9]{40,}"), "[REDACTED:xai_key]"),
    (re.compile(r"ghp_[A-Za-z0-9]{36,}"), "[REDACTED:github_pat]"),
    (re.compile(r"AKIA[0-9A-Z]{16}"), "[REDACTED:aws_access_key]"),
    (re.compile(r"-----BEGIN[A-Z ]+PRIVATE KEY-----[\s\S]+?-----END[A-Z ]+PRIVATE KEY-----"),
     "[REDACTED:private_key]"),
    # ... 12+ patterns total
]

For credentials_vault.md and high-risk files: also partial-redact emails ([email protected][user]@example.com — keep domain for cross-source diff signal).

6 unit tests on test_redact.py, 100% pass, runs in CI.

launchd catch-up logic

macOS cron doesn’t fire missed runs after sleep/off; launchd does — with the right shape:

<key>StartCalendarInterval</key>
<dict>
    <key>Hour</key><integer>3</integer>
    <key>Minute</key><integer>1</integer>
</dict>
<key>RunAtLoad</key><true/>

Combined with state-file dedupe in audit.py:

last_run_iso = state.get(tier, {}).get("last_run")
if last_run_iso:
    hours_since = (now - parse(last_run_iso)).total_seconds() / 3600
    if hours_since < TIER_MIN_GAP_HOURS[tier]:
        log(f"SKIP: tier={tier} ran {hours_since:.1f}h ago")
        sys.exit(0)

Defense-in-depth: RunAtLoad fires on boot/wake → state file prevents duplicate runs in the same window.

Performance numbers

Measured on MacBook Pro M2 Max:

OperationNumberNotes
Daily run wall time~45 s~10 changed files, Qwen 4B + ~3 Haiku verifications
Weekly run wall time~6 minfull _personal workspace (~80 files)
Monthly run wall time~14 min+ cross-workspace, Qwen 32B
Event run wall time~8 sscoped to 1–3 files in commit
Judge call (Grok 4.3)~2.1 s p95per finding
Auto-fix apply (incl snapshot)~1.4 sper patch
Cost per daily run~$0.002mostly Grok judge
Cost per weekly run~$0.07larger context
Cost per monthly run~$0.18Layer 3 + bigger context
Monthly all-in~$5.00All tiers + judge + Telegram free
Real accuracy on holdout-9999%LLM-judged (Grok 4.3 judge)
False positive rate~12%measured on 50-finding sample

Reliability features

FeatureHow
Catch-up after Mac off/sleepRunAtLoad: true + state-file skip-if-recent
Idempotent tier dedupeper-tier last-run timestamp in state/last_run.json
PII never sent to LLM12-pattern regex + email partial-redact + 6-test CI gate
Auto-fix reversiblegit stash --include-untracked snapshot before apply
LLM API failoverGrok 4.3 → Haiku 4.5 verifier (lower precision but still surfaces findings)
Telegram delivery durabilityretry queue persisted to disk, drains on next run
Local Qwen OOMmlx-lm watchdog + smaller-batch retry
Finding deduplication(deprecated fingerprint suppression) → in-source explanatory comments

Security model

ThreatMitigation
Secrets leak to Anthropic/xAI API12-pattern regex redaction + email partial + 6 unit tests in CI
Auto-fix overwrites credentialshigh-risk file glob list excluded; safety heuristics reject patches under _secrets/
Auto-fix mis-appliesgit snapshot taken before each apply; one git revert <sha> to undo
Telegram token theftbot token in ~/.config/knowledge-audit/secrets.env, chmod 600
Local LLM exfilQwen runs offline; no outbound; only cloud calls are explicit (Haiku verify, Grok judge) — both go through redact.py first

Reproducibility — quickstart for a forker

# 1. Personal-RAG must already be running (or any bge-m3 + pgvector index)

# 2. Clone audit
cd ~/.claude/hooks/
git clone <your-fork>/audit-knowledge && cd audit-knowledge
python3.11 -m venv venv
./venv/bin/pip install -r requirements.txt  # mlx-lm, anthropic, xai-grok-sdk,
                                            # python-telegram-bot, pytest

# 3. Download Qwen quantized weights (one-time)
huggingface-cli download mlx-community/Qwen2.5-4B-Instruct-4bit
huggingface-cli download mlx-community/Qwen2.5-8B-Instruct-4bit
huggingface-cli download mlx-community/Qwen2.5-32B-Instruct-4bit

# 4. Configure secrets
cat > ~/.config/knowledge-audit/secrets.env <<EOF
ANTHROPIC_API_KEY=sk-ant-...
XAI_API_KEY=xai-...
TELEGRAM_BOT_TOKEN=...
TELEGRAM_CHAT_ID=...
EOF
chmod 600 ~/.config/knowledge-audit/secrets.env

# 5. Install launchd plists
cp launchd/*.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/ai.knowledge-audit.*

# 6. First run (force daily tier ignoring state)
./venv/bin/python audit.py --tier=daily --force

# 7. Configure post-commit hook on KB-s3 mount
ln -s ~/.claude/hooks/audit-knowledge/event_hook.sh \
      ~/Documents/KB-s3/.git/hooks/post-commit

# 8. Verify Telegram digest arrives

Future work

  • Reverse audit — scan codebase for facts that should be in memory but aren’t (proactive bootstrap)
  • Adversarial test corpus — synthetic stale facts with known answers, measure precision/recall as a benchmark
  • Drift trend chart — sparkline of weekly RED count in digest header
  • Per-workspace judge tuning — different prompts for _personal vs LL (different style of contradictions)

License & attribution

Personal project. Built on: