A daemon + Claude Skill that gives Personal-RAG a dedicated trust tier for external AI vendor documentation. Crawls Anthropic docs, xAI docs, HuggingFace model cards, and cited arXiv PDFs into a separate _canon workspace — routed first for any spec/price/version question.
At a glance
- ~180 docs / ~3,400 chunks indexed in
_canonworkspace (28 MB stored) - Daily crawl, ~6 min wall time — pulls deltas from a small allowlist of authoritative sources
- Routing overhead: +12 ms p50 added to query path
- Hallucination rate on vendor facts: ~22% → <3% after deploy (judged sample)
- Two modes shipped: Mode A =
fukuro-auditClaude Skill (ideation + production audit), Mode C = crawler daemon - Crawler is the only writer to
_canon— no manual ingest allowed (workspace integrity by design) - Extends Personal-RAG — reuses bge-m3 embedder, Postgres + pgvector store, MinIO BlobStore; adds one workspace + one MCP tool (
kb_search_canon) - Routing rule in MCP orchestrator — spec/price/version queries hit
kb_search_canonfirst, fall back tokb_search_personal - Cost: ~$0/month — runs locally on M2 Max alongside the rest of the tako daemon
Stack
Python 3.11 · launchd (daily crawl timer) · BeautifulSoup + httpx (fetch/parse) · bge-m3 (shared embedder) · Postgres 16 + pgvector (shared HNSW store) · MinIO S3 (BlobStore mount) · Claude Skill SDK (Mode A: fukuro-audit)
Documentation
| Doc | Read this for |
|---|---|
| PRD | Problem framing, JTBD, scope cuts (Mode B dropped), success metrics |
| Architecture | Two-mode design, crawl pipeline, routing decision, workspace boundary |
| Implementation | Allowlist rules, crawler internals, classifier wiring, perf numbers |
| Notes | Chronological decision log + scope-cut reasoning |
Allowlist (today)
| Source | Scope | Cadence |
|---|---|---|
docs.anthropic.com/* | Full crawl, deltas | Daily |
docs.x.ai/* | Full crawl, deltas | Daily |
huggingface.co/<org>/<model> | Model cards only | Daily |
| arXiv PDFs cited in personal notes | One-shot ingest | On reference |
Explicitly excluded: blog posts, third-party benchmarks, Twitter threads, personal notes. These live in _personal or _shared, never _canon.
Routing rule
| Query pattern | Workspace |
|---|---|
| ”What’s the Haiku 4.5 input price?” | _canon |
| ”How did I configure Haiku in my project?” | _personal |
| ”Sonnet vs Opus differences?” | _canon first, _personal supplement |
| ”What did I write about Sonnet last week?” | _personal only |
Project status
| Phase | Milestone |
|---|---|
| Design | Mode A/B/C scoped; Mode B dropped (1 data point insufficient, kb-audit overlap) |
| Mode A | fukuro-audit Claude Skill shipped (ideation + production branches) |
| Mode C | Crawler daemon + allowlist + launchd timer live |
| Workspace | _canon registered in tako server v0.6.0-s3; kb_search_canon MCP tool wired |
| Routing | MCP orchestrator playbook updated; classifier label _canon distinct from _shared |
Launched: 2026-05-24. Total build: ~1 day from design lock to ship.
Why a new workspace, not a tag
First instinct was source_type=vendor_doc + boost in re-rank. Tried it, was fragile: every new ingestion source needed the tag, classifier missed ~15%, boost weights needed tuning.
Reframing as a workspace is stronger because the routing decision happens before retrieval, not after. Same pattern is reusable for pm_canon, design_canon, etc.
Extensibility
The _canon workspace pattern is the deliverable, not the AI-vendor allowlist. Future canon workspaces planned:
pm_canon— PM frameworks (Marty Cagan, Lenny, Reforge canon)design_canon— design systems (Linear, Notion, Apple HIG)eng_canon— language/framework official docs
Same daemon shape, different allowlist + different routing rule.