AI-Canon-Crawler — Authoritative Vendor Doc RAG

A daemon + Claude Skill that gives Personal-RAG a dedicated trust tier for external AI vendor documentation. Crawls Anthropic docs, xAI docs, HuggingFace model cards, and cited arXiv PDFs into a separate _canon workspace — routed first for any spec/price/version question.

At a glance

~180 docs / ~3,400 chunks indexed in _canon workspace (28 MB stored)
Daily crawl, ~6 min wall time — pulls deltas from a small allowlist of authoritative sources
Routing overhead: +12 ms p50 added to query path
Hallucination rate on vendor facts: ~22% → <3% after deploy (judged sample)
Two modes shipped: Mode A = fukuro-audit Claude Skill (ideation + production audit), Mode C = crawler daemon
Crawler is the only writer to _canon — no manual ingest allowed (workspace integrity by design)
Extends Personal-RAG — reuses bge-m3 embedder, Postgres + pgvector store, MinIO BlobStore; adds one workspace + one MCP tool (kb_search_canon)
Routing rule in MCP orchestrator — spec/price/version queries hit kb_search_canon first, fall back to kb_search_personal
Cost: ~$0/month — runs locally on M2 Max alongside the rest of the tako daemon

Stack

Python 3.11 · launchd (daily crawl timer) · BeautifulSoup + httpx (fetch/parse) · bge-m3 (shared embedder) · Postgres 16 + pgvector (shared HNSW store) · MinIO S3 (BlobStore mount) · Claude Skill SDK (Mode A: fukuro-audit)

Documentation

Doc	Read this for
PRD	Problem framing, JTBD, scope cuts (Mode B dropped), success metrics
Architecture	Two-mode design, crawl pipeline, routing decision, workspace boundary
Implementation	Allowlist rules, crawler internals, classifier wiring, perf numbers
Notes	Chronological decision log + scope-cut reasoning

Allowlist (today)

Source	Scope	Cadence
`docs.anthropic.com/*`	Full crawl, deltas	Daily
`docs.x.ai/*`	Full crawl, deltas	Daily
`huggingface.co/<org>/<model>`	Model cards only	Daily
arXiv PDFs cited in personal notes	One-shot ingest	On reference

Explicitly excluded: blog posts, third-party benchmarks, Twitter threads, personal notes. These live in _personal or _shared, never _canon.

Routing rule

Query pattern	Workspace
”What’s the Haiku 4.5 input price?”	`_canon`
”How did I configure Haiku in my project?”	`_personal`
”Sonnet vs Opus differences?”	`_canon` first, `_personal` supplement
”What did I write about Sonnet last week?”	`_personal` only

Project status

Phase	Milestone
Design	Mode A/B/C scoped; Mode B dropped (1 data point insufficient, kb-audit overlap)
Mode A	`fukuro-audit` Claude Skill shipped (ideation + production branches)
Mode C	Crawler daemon + allowlist + launchd timer live
Workspace	`_canon` registered in tako server v0.6.0-s3; `kb_search_canon` MCP tool wired
Routing	MCP orchestrator playbook updated; classifier label `_canon` distinct from `_shared`

Launched: 2026-05-24. Total build: ~1 day from design lock to ship.

Why a new workspace, not a tag

First instinct was source_type=vendor_doc + boost in re-rank. Tried it, was fragile: every new ingestion source needed the tag, classifier missed ~15%, boost weights needed tuning.

Reframing as a workspace is stronger because the routing decision happens before retrieval, not after. Same pattern is reusable for pm_canon, design_canon, etc.

Extensibility

The _canon workspace pattern is the deliverable, not the AI-vendor allowlist. Future canon workspaces planned:

pm_canon — PM frameworks (Marty Cagan, Lenny, Reforge canon)
design_canon — design systems (Linear, Notion, Apple HIG)
eng_canon — language/framework official docs

Same daemon shape, different allowlist + different routing rule.