● Shipped P1 Size S Foundation

AI-Canon-Crawler — Authoritative Vendor Doc RAG

A sister tool to Personal-RAG that crawls authoritative AI vendor docs into a dedicated _canon workspace, with routing rules that prefer vendor truth for spec/price/version questions.

A daemon + Claude Skill that gives Personal-RAG a dedicated trust tier for external AI vendor documentation. Crawls Anthropic docs, xAI docs, HuggingFace model cards, and cited arXiv PDFs into a separate _canon workspace — routed first for any spec/price/version question.

At a glance

  • ~180 docs / ~3,400 chunks indexed in _canon workspace (28 MB stored)
  • Daily crawl, ~6 min wall time — pulls deltas from a small allowlist of authoritative sources
  • Routing overhead: +12 ms p50 added to query path
  • Hallucination rate on vendor facts: ~22% → <3% after deploy (judged sample)
  • Two modes shipped: Mode A = fukuro-audit Claude Skill (ideation + production audit), Mode C = crawler daemon
  • Crawler is the only writer to _canon — no manual ingest allowed (workspace integrity by design)
  • Extends Personal-RAG — reuses bge-m3 embedder, Postgres + pgvector store, MinIO BlobStore; adds one workspace + one MCP tool (kb_search_canon)
  • Routing rule in MCP orchestrator — spec/price/version queries hit kb_search_canon first, fall back to kb_search_personal
  • Cost: ~$0/month — runs locally on M2 Max alongside the rest of the tako daemon

Stack

Python 3.11 · launchd (daily crawl timer) · BeautifulSoup + httpx (fetch/parse) · bge-m3 (shared embedder) · Postgres 16 + pgvector (shared HNSW store) · MinIO S3 (BlobStore mount) · Claude Skill SDK (Mode A: fukuro-audit)

Documentation

DocRead this for
PRDProblem framing, JTBD, scope cuts (Mode B dropped), success metrics
ArchitectureTwo-mode design, crawl pipeline, routing decision, workspace boundary
ImplementationAllowlist rules, crawler internals, classifier wiring, perf numbers
NotesChronological decision log + scope-cut reasoning

Allowlist (today)

SourceScopeCadence
docs.anthropic.com/*Full crawl, deltasDaily
docs.x.ai/*Full crawl, deltasDaily
huggingface.co/<org>/<model>Model cards onlyDaily
arXiv PDFs cited in personal notesOne-shot ingestOn reference

Explicitly excluded: blog posts, third-party benchmarks, Twitter threads, personal notes. These live in _personal or _shared, never _canon.

Routing rule

Query patternWorkspace
”What’s the Haiku 4.5 input price?”_canon
”How did I configure Haiku in my project?”_personal
”Sonnet vs Opus differences?”_canon first, _personal supplement
”What did I write about Sonnet last week?”_personal only

Project status

PhaseMilestone
DesignMode A/B/C scoped; Mode B dropped (1 data point insufficient, kb-audit overlap)
Mode Afukuro-audit Claude Skill shipped (ideation + production branches)
Mode CCrawler daemon + allowlist + launchd timer live
Workspace_canon registered in tako server v0.6.0-s3; kb_search_canon MCP tool wired
RoutingMCP orchestrator playbook updated; classifier label _canon distinct from _shared

Launched: 2026-05-24. Total build: ~1 day from design lock to ship.

Why a new workspace, not a tag

First instinct was source_type=vendor_doc + boost in re-rank. Tried it, was fragile: every new ingestion source needed the tag, classifier missed ~15%, boost weights needed tuning.

Reframing as a workspace is stronger because the routing decision happens before retrieval, not after. Same pattern is reusable for pm_canon, design_canon, etc.

Extensibility

The _canon workspace pattern is the deliverable, not the AI-vendor allowlist. Future canon workspaces planned:

  • pm_canon — PM frameworks (Marty Cagan, Lenny, Reforge canon)
  • design_canon — design systems (Linear, Notion, Apple HIG)
  • eng_canon — language/framework official docs

Same daemon shape, different allowlist + different routing rule.

📚

STACK

  • Python 3.11
  • launchd
  • BeautifulSoup
  • httpx
  • bge-m3
  • Postgres 16 + pgvector
  • MinIO S3
  • Claude Skill SDK