AI-Canon-Crawler: why I gave my RAG a separate workspace for vendor truth

TL;DR — My Personal-RAG had ~50k sources mixed together: my own notes, project READMEs, meeting transcripts, vendor docs. When I asked “what’s the Haiku 4.5 input price?”, the retriever surfaced my notes from 3 months ago before the vendor’s current pricing page. I built AI-Canon-Crawler — a daemon that crawls authoritative AI vendor docs (Anthropic, xAI, HuggingFace model cards, arXiv papers I cite) into a dedicated _canon workspace. The routing layer now prefers _canon for spec/price/version questions. Hallucination rate on vendor facts dropped from ~22% to <3%.

JTBD

When I ask my Personal-RAG a factual question about an external AI vendor (pricing, context window, quantization size, API parameter), I want the answer to come from the vendor’s current docs, not from a note I wrote 3 months ago that might be stale.

This is a different JTBD from “find what I wrote about X.” Same retrieval system, different trust tier.

The bug that motivated this

I asked the RAG: “What’s the current price of Claude Haiku 4.5?”

Top result: a note I’d written in February speculating about pricing. Score 0.81. Vendor’s current pricing page: also indexed, score 0.74.

The retriever did its job — semantic similarity is what I asked it to optimize. But the trust tier was wrong. My speculation outranked the source of truth.

Patching the embedding wouldn’t fix this. The fundamental issue: my notes and vendor docs were the same kind of object to the retriever.

Three product decisions

1. Separate workspace, not a tag

First instinct: add source_type=vendor_doc as a tag, boost in re-rank. Tried it — fragile. Every new ingestion source needed the tag, classifier missed 15%, boost weights needed tuning.

Reframed as a workspace: _canon is a separate logical store with its own ingestion pipeline, its own MCP search tool (kb_search_canon), and its own retention policy (always overwrite-on-recrawl).

Workspaces are stronger than tags because the routing decision happens before retrieval, not after.

2. Crawl, don’t accept submissions

The crawler is the only writer to _canon. I can’t manually add a file. This sounds restrictive — it’s the point.

If I could manually add files, my own notes would leak into the canon workspace within a month. Every accidental note becomes a future hallucination source.

The constraint: _canon content must originate from a small allowlist of authoritative URLs (vendor docs, arXiv paper PDFs, official HuggingFace model cards). Daemon checks the allowlist on every crawl tick.

3. Routing rule lives in the orchestrator, not the model

When the question contains spec/price/version keywords → call kb_search_canon first, fall back to kb_search_personal only if canon returns no hits.

This rule lives in the MCP server’s tool-routing playbook, not in a prompt. Prompts drift. Routing rules don’t.

Query pattern	Workspace
”What’s the Haiku 4.5 input price?”	`_canon`
”How did I configure Haiku in my project?”	`_personal`
”What’s the difference between Sonnet and Opus?”	`_canon` first, `_personal` as supplement
”What did I write about Sonnet last week?”	`_personal` only

The routing is the product. The crawler is the input pipeline.

What “authoritative” means in practice

The allowlist today:

docs.anthropic.com/*
docs.x.ai/*
huggingface.co/<org>/<model> (model cards only)
arXiv PDFs I cite in my own notes (one-shot ingest, not recrawled)

Notably excluded: blog posts, third-party benchmarks, Twitter threads, my own notes. These can live in _personal or _shared, never _canon.

Numbers after week 2

Metric	Before	After
Hallucination rate on vendor facts (judged sample)	~22%	<3%
`_canon` workspace size	—	~180 docs, ~3,400 chunks
Crawl cadence	—	Daily, ~6 min wall time
Storage	—	28 MB
Routing overhead	—	+12ms p50

The hallucination drop comes from two things, not one:

Vendor docs are now indexed at all (some weren’t before).
Routing prefers them when the question type matches.

Either one alone would have helped less.

What I’d tell a PM building RAG

Trust tier is a product concept, not an engineering one. Different sources need different retrieval priority, not just different embeddings.
Workspaces > tags when the trust decision should happen before retrieval.
The crawler’s allowlist is the canon’s integrity. One leaky source contaminates the workspace for months.
Route by query intent, not just content similarity. “What is X” and “what did I say about X” are different products.
Measure hallucination on a held-out set of vendor questions. Don’t trust gut feel — the retriever may look fine on dev queries and fail on the questions that matter.

Cost: ~$0/month (crawler runs locally). Replaces a vendor-doc-tab-hunting habit and several embarrassing wrong answers I was about to publish. Worth it.