AI-Canon-Crawler — PRD

Size S · P1 · Foundation Status: ✅ Shipped 2026-05-24 — see Implementation for build details Actual: ~1 day from design lock → daemon live

1. Problem

Personal-RAG has ~50k sources mixed together: personal notes, project READMEs, meeting transcripts, vendor docs. When asked a factual question about an external AI vendor (pricing, context window, quantization size, API parameter), the retriever surfaces personal notes from months ago before the vendor’s current docs.

Concrete bug: query “What’s the current price of Claude Haiku 4.5?” → top result was a February note speculating about pricing (score 0.81). The vendor’s actual pricing page ranked second (0.74).

Pain: ~22% hallucination rate on vendor-fact questions (judged sample). The retriever isn’t broken — semantic similarity is what it was asked to optimize. The problem is trust tier: personal notes and vendor docs were the same kind of object to the retriever.

Why now: hallucinations on vendor facts were starting to leak into blog drafts and project decisions. A fix at the embedding layer wouldn’t help — the issue is upstream of retrieval.

2. Goal & Success Metrics

Goal: When a question is about an external AI vendor’s spec/price/version, the answer comes from the vendor’s current docs — not from a stale personal note.

Metrics — actual achieved:

Metric	Target	Achieved	Note
Hallucination rate on vendor facts	<5%	<3%	Judged on 50-question held-out set
`_canon` workspace size	100+ docs	~180 docs / 3,400 chunks	Anthropic + xAI + HF model cards
Crawl wall time	<15 min	~6 min	Daily delta crawl
Routing overhead p50	<30 ms	+12 ms	Tool-routing decision before retrieval
Storage footprint	<100 MB	28 MB	Small vs personal workspace (~3.2 GB)

3. User journey

User asks Claude (any client): “What’s the current Haiku 4.5 input price?”
MCP orchestrator detects spec/price/version intent → routes to kb_search_canon first.
_canon returns vendor’s pricing page chunks with high confidence.
If _canon empty, fall back to kb_search_personal.
Claude synthesizes answer citing the vendor URL.

Parallel: daemon runs daily, crawls allowlist URLs, deltas only, embeds via shared bge-m3, upserts into _canon workspace.

4. Scope (MoSCoW) — final

Must — DONE:

✅ Dedicated _canon workspace in Postgres (extends tako schema)
✅ MCP tool kb_search_canon registered server-side
✅ Crawler daemon — Mode C — daily launchd timer
✅ Allowlist enforced on every crawl tick (no off-list ingests)
✅ Routing rule in MCP orchestrator playbook (spec/price/version → canon first)
✅ fukuro-audit Claude Skill — Mode A — ideation + production audit branches

Should — DONE:

✅ Idempotent re-ingest via SHA-256 hash check (shared with Personal-RAG)
✅ Overwrite-on-recrawl policy (canon is always latest vendor truth)
✅ Classifier label _canon distinct from _shared in tako orchestration

Could — DROPPED:

❌ Mode B (continuous evidence gathering during conversation) — dropped per scope-cut. Reasoning: only 1 data point of demand (the original hallucination bug), and overlap with the existing kb-audit weekly job made the marginal value unclear. Re-evaluate after Mode A/C have 4 weeks of usage.
⏸️ pm_canon / design_canon workspaces — pattern proven, deferred to dedicated PRDs
⏸️ Web UI for browsing _canon content — Claude clients are sufficient

Won’t (M1):

Multi-vendor allowlist expansion beyond Anthropic/xAI/HF (one at a time, measured)
Auto-discovery of new vendor doc URLs (allowlist is a feature, not a limit)
Real-time crawl on every conversation (daily is enough; cost/value not justified)

5. Architecture (final)

Two-mode design (after Mode B drop):

Mode A: fukuro-audit Claude Skill — invoked by user, audits ideation or production projects against canon
Mode C: Crawler daemon — runs daily via launchd, sole writer to _canon

See Architecture for diagrams.

6. Tech Stack — final choices

Layer	Choice	Reason
Crawler runtime	Python 3.11	Shared with tako daemon, single venv
Scheduler	launchd	Already managing tako mount-watcher + backups; one less moving part
HTML fetch	httpx (async)	Concurrent allowlist fetch; auto-retry built-in
Parser	BeautifulSoup	Stable, sufficient for vendor doc HTML
Embedder	bge-m3 (shared)	Same model as Personal-RAG; no extra cold start
Vector store	Postgres 16 + pgvector	Shared HNSW index, one DB per workspace tag
Blob storage	MinIO S3 (BlobStore)	Reuses tako mount path; canonical raw HTML archived
Skill SDK	Claude Skill SDK	Mode A = `fukuro-audit` skill

Cost posture: $0/month. Daemon runs on M2 Max alongside tako. No external infra.

7. Milestones — actual

Phase	What shipped
Design	Mode A/B/C scoped on paper; Mode B dropped before any code written
Workspace	`_canon` added to tako server v0.6.0-s3 (schema + classifier + MCP tool)
Mode A	`fukuro-audit` Claude Skill — ideation + production branches
Mode C	Crawler + allowlist + daily launchd timer; first crawl ingested ~180 docs
Routing	MCP orchestrator playbook updated; `_shared` vs `_canon` distinction documented

Ship DoD passed:

✅ Hallucination drop measured on 50-question set (<3%)
✅ Crawl runs nightly, deltas only, <15 min wall time
✅ _canon is the only workspace the crawler can write to (verified by code path)
✅ Routing rule live and tested with mixed-intent queries

8. Cost & Quota

Item	Cost
Compute (M2 Max, shared with tako)	$0
Postgres + pgvector (local)	$0
MinIO S3 mount (local)	$0
External LLM calls	$0 (crawler is deterministic; no LLM in hot path)
Total	~$0/month

9. Risks & open questions — outcomes

Risks identified at design:

Allowlist drift (vendor changes URL structure) → daemon logs 404s loudly; manual allowlist patch when it happens
_canon content leaking into _personal if classifier misfires → mitigated by workspace-level segregation (DB-level, not tag-level)
Crawl bandwidth hitting vendor rate limits → throttled to 1 req/s per host; well under public limits

Resolved:

Mode B value question → resolved by drop. Two modes is enough.
_shared vs _canon ambiguity → resolved with explicit rule in orchestrator playbook: _shared = Marc’s curated notes (subjective), _canon = vendor truth (objective)

Open (M2):

Q: When to add pm_canon / design_canon? → Wait for measurable pain in those domains; don’t build speculatively.
Q: How to detect a stale _canon entry if vendor silently removes a page? → 404-tracking + auto-deindex; not yet implemented.

10. Definition of Done

Ship done: ✅ 2026-05-24 — _canon workspace live, crawler daemon running daily, fukuro-audit skill installed, routing rule deployed, hallucination metric measured.

Production-stable done (4-week criterion):

⏳ 4 consecutive weeks of daily crawl with no manual intervention
⏳ Hallucination rate stays <5% over a rolling 50-question sample
⏳ No _canon entries written by any path other than the crawler

AI-Canon-Crawler — PRD

AI-Canon-Crawler — PRD

1. Problem

2. Goal & Success Metrics

3. User journey

4. Scope (MoSCoW) — final

5. Architecture (final)

6. Tech Stack — final choices

7. Milestones — actual

8. Cost & Quota

9. Risks & open questions — outcomes

10. Definition of Done

See also