← Back to project
● Shipped P1 Size S Foundation

AI-Canon-Crawler — PRD

Product spec, scope, milestones, and success metrics for AI-Canon-Crawler.

AI-Canon-Crawler — PRD

Size S · P1 · Foundation Status: ✅ Shipped 2026-05-24 — see Implementation for build details Actual: ~1 day from design lock → daemon live

1. Problem

Personal-RAG has ~50k sources mixed together: personal notes, project READMEs, meeting transcripts, vendor docs. When asked a factual question about an external AI vendor (pricing, context window, quantization size, API parameter), the retriever surfaces personal notes from months ago before the vendor’s current docs.

Concrete bug: query “What’s the current price of Claude Haiku 4.5?” → top result was a February note speculating about pricing (score 0.81). The vendor’s actual pricing page ranked second (0.74).

Pain: ~22% hallucination rate on vendor-fact questions (judged sample). The retriever isn’t broken — semantic similarity is what it was asked to optimize. The problem is trust tier: personal notes and vendor docs were the same kind of object to the retriever.

Why now: hallucinations on vendor facts were starting to leak into blog drafts and project decisions. A fix at the embedding layer wouldn’t help — the issue is upstream of retrieval.

2. Goal & Success Metrics

Goal: When a question is about an external AI vendor’s spec/price/version, the answer comes from the vendor’s current docs — not from a stale personal note.

Metrics — actual achieved:

MetricTargetAchievedNote
Hallucination rate on vendor facts<5%<3%Judged on 50-question held-out set
_canon workspace size100+ docs~180 docs / 3,400 chunksAnthropic + xAI + HF model cards
Crawl wall time<15 min~6 minDaily delta crawl
Routing overhead p50<30 ms+12 msTool-routing decision before retrieval
Storage footprint<100 MB28 MBSmall vs personal workspace (~3.2 GB)

3. User journey

  1. User asks Claude (any client): “What’s the current Haiku 4.5 input price?”
  2. MCP orchestrator detects spec/price/version intent → routes to kb_search_canon first.
  3. _canon returns vendor’s pricing page chunks with high confidence.
  4. If _canon empty, fall back to kb_search_personal.
  5. Claude synthesizes answer citing the vendor URL.

Parallel: daemon runs daily, crawls allowlist URLs, deltas only, embeds via shared bge-m3, upserts into _canon workspace.

4. Scope (MoSCoW) — final

Must — DONE:

  • ✅ Dedicated _canon workspace in Postgres (extends tako schema)
  • ✅ MCP tool kb_search_canon registered server-side
  • ✅ Crawler daemon — Mode C — daily launchd timer
  • ✅ Allowlist enforced on every crawl tick (no off-list ingests)
  • ✅ Routing rule in MCP orchestrator playbook (spec/price/version → canon first)
  • fukuro-audit Claude Skill — Mode A — ideation + production audit branches

Should — DONE:

  • ✅ Idempotent re-ingest via SHA-256 hash check (shared with Personal-RAG)
  • ✅ Overwrite-on-recrawl policy (canon is always latest vendor truth)
  • ✅ Classifier label _canon distinct from _shared in tako orchestration

Could — DROPPED:

  • Mode B (continuous evidence gathering during conversation) — dropped per scope-cut. Reasoning: only 1 data point of demand (the original hallucination bug), and overlap with the existing kb-audit weekly job made the marginal value unclear. Re-evaluate after Mode A/C have 4 weeks of usage.
  • ⏸️ pm_canon / design_canon workspaces — pattern proven, deferred to dedicated PRDs
  • ⏸️ Web UI for browsing _canon content — Claude clients are sufficient

Won’t (M1):

  • Multi-vendor allowlist expansion beyond Anthropic/xAI/HF (one at a time, measured)
  • Auto-discovery of new vendor doc URLs (allowlist is a feature, not a limit)
  • Real-time crawl on every conversation (daily is enough; cost/value not justified)

5. Architecture (final)

Two-mode design (after Mode B drop):

  • Mode A: fukuro-audit Claude Skill — invoked by user, audits ideation or production projects against canon
  • Mode C: Crawler daemon — runs daily via launchd, sole writer to _canon

See Architecture for diagrams.

6. Tech Stack — final choices

LayerChoiceReason
Crawler runtimePython 3.11Shared with tako daemon, single venv
SchedulerlaunchdAlready managing tako mount-watcher + backups; one less moving part
HTML fetchhttpx (async)Concurrent allowlist fetch; auto-retry built-in
ParserBeautifulSoupStable, sufficient for vendor doc HTML
Embedderbge-m3 (shared)Same model as Personal-RAG; no extra cold start
Vector storePostgres 16 + pgvectorShared HNSW index, one DB per workspace tag
Blob storageMinIO S3 (BlobStore)Reuses tako mount path; canonical raw HTML archived
Skill SDKClaude Skill SDKMode A = fukuro-audit skill

Cost posture: $0/month. Daemon runs on M2 Max alongside tako. No external infra.

7. Milestones — actual

PhaseWhat shipped
DesignMode A/B/C scoped on paper; Mode B dropped before any code written
Workspace_canon added to tako server v0.6.0-s3 (schema + classifier + MCP tool)
Mode Afukuro-audit Claude Skill — ideation + production branches
Mode CCrawler + allowlist + daily launchd timer; first crawl ingested ~180 docs
RoutingMCP orchestrator playbook updated; _shared vs _canon distinction documented

Ship DoD passed:

  • ✅ Hallucination drop measured on 50-question set (<3%)
  • ✅ Crawl runs nightly, deltas only, <15 min wall time
  • _canon is the only workspace the crawler can write to (verified by code path)
  • ✅ Routing rule live and tested with mixed-intent queries

8. Cost & Quota

ItemCost
Compute (M2 Max, shared with tako)$0
Postgres + pgvector (local)$0
MinIO S3 mount (local)$0
External LLM calls$0 (crawler is deterministic; no LLM in hot path)
Total~$0/month

9. Risks & open questions — outcomes

Risks identified at design:

  • Allowlist drift (vendor changes URL structure) → daemon logs 404s loudly; manual allowlist patch when it happens
  • _canon content leaking into _personal if classifier misfires → mitigated by workspace-level segregation (DB-level, not tag-level)
  • Crawl bandwidth hitting vendor rate limits → throttled to 1 req/s per host; well under public limits

Resolved:

  • Mode B value question → resolved by drop. Two modes is enough.
  • _shared vs _canon ambiguity → resolved with explicit rule in orchestrator playbook: _shared = Marc’s curated notes (subjective), _canon = vendor truth (objective)

Open (M2):

  • Q: When to add pm_canon / design_canon? → Wait for measurable pain in those domains; don’t build speculatively.
  • Q: How to detect a stale _canon entry if vendor silently removes a page? → 404-tracking + auto-deindex; not yet implemented.

10. Definition of Done

Ship done: ✅ 2026-05-24 — _canon workspace live, crawler daemon running daily, fukuro-audit skill installed, routing rule deployed, hallucination metric measured.

Production-stable done (4-week criterion):

  • ⏳ 4 consecutive weeks of daily crawl with no manual intervention
  • ⏳ Hallucination rate stays <5% over a rolling 50-question sample
  • ⏳ No _canon entries written by any path other than the crawler

See also

  • Architecture — Mode A vs Mode C, crawl pipeline, routing
  • Implementation — allowlist code, classifier wiring, perf
  • Notes — chronological decisions including the Mode B drop