← Back to project
● Shipped P1 Size S Foundation

AI-Canon-Crawler — Implementation

Tech stack deep-dive: allowlist, crawler internals, classifier wiring, perf numbers.

Implementation

Sister docs: PRD (intent), Architecture (system view), Notes (decision log).

TL;DR

Shipped in ~1 day on top of Personal-RAG. Two surfaces:

  • Mode Afukuro-audit Claude Skill: ideation + production audit branches, reads _canon for vendor truth
  • Mode C — Crawler daemon: launchd daily, sole writer to _canon, ~180 docs / 3,400 chunks / 28 MB
  • +12 ms p50 routing overhead, hallucination rate on vendor facts dropped ~22% → <3%
  • $0/month — runs on M2 Max alongside the existing tako daemon
  • Mode B was scoped and dropped before any code (1 data point insufficient + overlap with kb-audit)

Stack

LayerComponentNotes
RuntimePython 3.11Shared venv with tako daemon
Schedulerlaunchdai.vuihoc.fukuro-crawl.daily.plist, fires once/day
HTTPhttpx (async)Concurrent fetch, retry, custom User-Agent
HTML parseBeautifulSoup4Vendor doc HTML is stable; lxml backend
Embedderbge-m3 (shared)Same instance as Personal-RAG, no extra cold start
Vector storePostgres 16 + pgvectorHNSW index, workspace column for segregation
Blob storeMinIO S3 via BlobStoreRaw HTML archived at s3://canon/<host>/<path>.html
Skill SDKClaude Skill SDKMode A = fukuro-audit skill

Directory layout

~/Documents/Side.Projects/tako/server/
├── src/
│   ├── server_local.py          # MCP server (existing) — added kb_search_canon tool
│   ├── workspaces.py            # added _canon registration
│   └── routing.py               # added canon-first rule for spec/price intents
└── fukuro/
    ├── crawler.py               # Mode C daemon entrypoint
    ├── allowlist.yml            # source of truth for what can enter _canon
    ├── fetchers/
    │   ├── anthropic_docs.py
    │   ├── xai_docs.py
    │   ├── hf_model_card.py
    │   └── arxiv_pdf.py
    ├── parser.py                # BeautifulSoup extraction
    └── guard.py                 # allowlist enforcement (final check before write)

~/.claude/skills/fukuro-audit/
├── SKILL.md                     # Mode A skill definition
└── prompts/
    ├── ideation.md
    └── production.md

~/Library/LaunchAgents/
└── ai.vuihoc.fukuro-crawl.daily.plist

Allowlist

# fukuro/allowlist.yml
sources:
  - host: docs.anthropic.com
    pattern: "/**"
    cadence: daily
    fetcher: anthropic_docs

  - host: docs.x.ai
    pattern: "/**"
    cadence: daily
    fetcher: xai_docs

  - host: huggingface.co
    pattern: "/{org}/{model}"      # model cards ONLY, not blog/spaces/datasets
    cadence: daily
    fetcher: hf_model_card

  - host: arxiv.org
    pattern: "/pdf/*"
    cadence: on_reference          # one-shot, triggered when cited in _personal
    fetcher: arxiv_pdf

Explicitly excluded (parsed by guard, never crawled): blog posts (/blog/*), third-party benchmarks, Twitter/X threads, personal notes, GitHub repos.

Guard (the integrity contract)

# fukuro/guard.py
ALLOWED_HOSTS = {"docs.anthropic.com", "docs.x.ai", "huggingface.co", "arxiv.org"}

class IntegrityError(Exception): ...

def guard_url(url: str, allowlist: list[Rule]) -> None:
    """Final check before any _canon write. Raises on violation."""
    parsed = urlparse(url)
    if parsed.netloc not in ALLOWED_HOSTS:
        raise IntegrityError(f"host {parsed.netloc} not in allowlist")
    if not any(rule.matches(url) for rule in allowlist):
        raise IntegrityError(f"url {url} matches no allowlist pattern")

def guard_workspace_write(workspace: str, caller: str) -> None:
    """Enforced server-side in tako: only crawler service account writes _canon."""
    if workspace == "_canon" and caller != "fukuro-crawler":
        raise IntegrityError(f"caller {caller} cannot write _canon workspace")

Both checks fire on every ingest attempt. No bypass path. This is the integrity contract.

Workspace registration (tako server change)

# tako/src/workspaces.py — added _canon
WORKSPACE_MAP = {
    "ll":        {"path": "~/Documents/KB/ll/",        "writable_by": "*"},
    "mindx":     {"path": "~/Documents/KB/mindx/",     "writable_by": "*"},
    "_personal": {"path": "~/Documents/KB/_personal/", "writable_by": "*"},
    "_shared":   {"path": "~/Documents/KB/_shared/",   "writable_by": "*"},
    "_canon":    {"path": "~/Documents/KB-s3/_canon/", "writable_by": "fukuro-crawler"},
    "_secrets":  {"path": "~/Documents/KB/_secrets/",  "writable_by": "vault-only"},
}

And the new MCP tool:

@mcp.tool()
async def kb_search_canon(query: str, top_k: int = 5) -> list[dict]:
    """Search the _canon workspace (external vendor authoritative docs).
    Use for: AI vendor pricing, model specs, context windows, API parameters,
    quantization sizes, deprecation notices. NOT for Marc's personal notes."""
    qvec = embed_model.encode([f"query: {query}"])[0]
    return await db.search(workspace="_canon", qvec=qvec, top_k=top_k)

Routing rule (orchestrator playbook)

The playbook lives in tako’s instructions.py and is sent to every connecting client via serverInfo.instructions. Excerpt of the canon routing rule:

| AI vendor spec/price/version/context/API param | kb_search_canon FIRST,
  kb_search_personal as supplement only if canon empty or score < 0.65 |

_shared vs _canon distinction:
  _shared = Marc's own research notes + cheatsheets (subjective, curated)
  _canon  = external vendor authoritative spec (objective, crawled)

  e.g. "RAG pattern"              → _shared (Marc's interpretation)
       "Anthropic Haiku 4.5 price" → _canon  (vendor truth)

The classifier is intentionally simple: keyword + intent match. No LLM in the hot path.

Crawler internals

# fukuro/crawler.py — main loop
async def crawl_once(allowlist: list[Rule]) -> CrawlReport:
    report = CrawlReport()
    async with httpx.AsyncClient(
        headers={"User-Agent": "fukuro-crawler/0.1 (+marc personal)"},
        timeout=30.0,
        limits=httpx.Limits(max_connections=4),
    ) as client:
        for rule in allowlist:
            urls = await rule.fetcher.discover(client)  # sitemap or doc index
            for url in urls:
                guard_url(url, allowlist)                # fail fast
                try:
                    resp = await client.get(url, headers={
                        "If-Modified-Since": last_seen(url)
                    })
                    if resp.status_code == 304:
                        report.skipped_304 += 1
                        continue
                    if resp.status_code == 404:
                        report.errors.append(("404", url))
                        continue
                    content = parser.extract_main(resp.text)
                    await blob_store.put(
                        f"s3://canon/{rule.host}{urlparse(url).path}.html",
                        resp.content
                    )
                    await ingest_canon(url=url, content=content, rule=rule)
                    report.ingested += 1
                except httpx.HTTPError as e:
                    report.errors.append((str(e), url))
                await asyncio.sleep(1.0)                 # throttle 1 req/s/host
    return report

Mode A — fukuro-audit Skill

# ~/.claude/skills/fukuro-audit/SKILL.md
---
name: fukuro-audit
description: |
  Audit AI infrastructure — ideation hoặc production. Trigger:
  "fukuro audit ideation: <idea>" → JTBD/prior-art/scope/ROI/deps/alts rubric
  "fukuro audit production: <project-slug>" → 6 categories code scan + Grok 4.3 judge
  Outputs ADHD-friendly digest: health score + collapsible P0/P1/P2 findings.
---

Audit branches:

  • Ideation: rubric covering JTBD clarity, prior-art collision (via _canon search), scope realism, ROI estimate, dependency risks, alternatives considered.
  • Production: scans the project repo for AI config (model IDs, prompt files, API params), cross-checks against _canon for drift (e.g. deprecated model, price change, new better option), Grok 4.3 judges severity into P0/P1/P2.

Both branches cite _canon URLs so the user can verify every claim.

Performance numbers

Measured on MacBook Pro M2 Max (shared with tako daemon):

OperationNumberNote
Daily crawl wall time~6 min~180 URLs, 1 req/s throttle, async I/O
Crawl HTTP rate1 req/s/hostwell under vendor rate limits
bge-m3 embed (shared)0.39 chunks/s on CPUalready sunk cost; reused
_canon chunks inserted~3,400initial seed crawl
_canon storage28 MBsmall vs _personal (~3.2 GB)
Routing decision overhead+12 ms p50intent-classifier in orchestrator
kb_search_canon p50~820 mscomparable to kb_search_personal (shared store)
Crawler RAM~80 MBwhen running
Crawler CPU<5% on M2 Maxduring 6-min crawl

Reliability features

FeatureHow
Idempotent re-crawlSHA-256 hash check; skip when content unchanged
Delta-aware fetchingIf-Modified-Since header; 304 short-circuits embed
Allowlist guardFires on every URL + every workspace write attempt
Raw HTML archiveMinIO mount preserves audit trail of crawled state
Auto-restartlaunchd KeepAlive on crash
404 loggingPer-URL counter; surfaces vendor URL rot
Throttle1 req/s per host enforced via asyncio.sleep

Security & integrity model

ConcernMitigation
Allowlist tamperingallowlist.yml checked into git; daemon refuses to start if hash mismatch
Workspace write bypassServer-side guard_workspace_write() rejects non-crawler callers
Filesystem ingest leakMount-watcher disabled for _canon/ path
Manual MCP kb_ingest to _canonServer returns 403 workspace_protected
Vendor doc poisoningOut of scope (trust the vendor); if vendor doc is wrong, that’s their bug
Crawler crash leaking partial dataTransactional upsert: chunks committed only after full source row

Cost

ItemCost
Compute (M2 Max shared)$0
Postgres + pgvector (local)$0
MinIO (local)$0
LLM in hot path$0 (deterministic crawler; no LLM call)
LLM in audit (Mode A, optional)~$0 per invocation (Grok 4.3 judge, sparse)
Total~$0/month

Reproducibility — for a forker

# Prereqs: Personal-RAG (tako) already running with Postgres + pgvector + MinIO
cd ~/Documents/Side.Projects/tako/server
git pull                                          # tako v0.6.0-s3 or later

# 1. Register _canon workspace
psql ragkb -c "INSERT INTO workspaces (name, writable_by)
               VALUES ('_canon', 'fukuro-crawler');"

# 2. Drop in fukuro/ folder, edit allowlist.yml to your taste
cp -r fukuro/ ~/Documents/Side.Projects/tako/server/

# 3. Install Claude Skill
cp -r skills/fukuro-audit ~/.claude/skills/

# 4. Install launchd timer
cp launchd/ai.vuihoc.fukuro-crawl.daily.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/ai.vuihoc.fukuro-crawl.daily.plist

# 5. First crawl (manual)
python -m fukuro.crawler --once

Total: 30 min if Personal-RAG already running.

Future work

  • Add pm_canon workspace (Marty Cagan, Lenny, Reforge canon) — same daemon shape, different allowlist
  • Add design_canon (Linear, Notion, Apple HIG)
  • 404-driven auto-deindex (when vendor removes a page)
  • Re-evaluate Mode B after 4 weeks of Mode A usage data
  • Selector-drift detector — auto-alert when parsed content shrinks >50% vs prior crawl

License & attribution

Personal project. Built on: