Implementation

Sister docs: PRD (intent), Architecture (system view), Notes (decision log).

TL;DR

Shipped in ~1 day on top of Personal-RAG. Two surfaces:

Mode A — fukuro-audit Claude Skill: ideation + production audit branches, reads _canon for vendor truth
Mode C — Crawler daemon: launchd daily, sole writer to _canon, ~180 docs / 3,400 chunks / 28 MB
+12 ms p50 routing overhead, hallucination rate on vendor facts dropped ~22% → <3%
$0/month — runs on M2 Max alongside the existing tako daemon
Mode B was scoped and dropped before any code (1 data point insufficient + overlap with kb-audit)

Stack

Layer	Component	Notes
Runtime	Python 3.11	Shared venv with tako daemon
Scheduler	launchd	`ai.vuihoc.fukuro-crawl.daily.plist`, fires once/day
HTTP	httpx (async)	Concurrent fetch, retry, custom User-Agent
HTML parse	BeautifulSoup4	Vendor doc HTML is stable; lxml backend
Embedder	bge-m3 (shared)	Same instance as Personal-RAG, no extra cold start
Vector store	Postgres 16 + pgvector	HNSW index, workspace column for segregation
Blob store	MinIO S3 via BlobStore	Raw HTML archived at `s3://canon/<host>/<path>.html`
Skill SDK	Claude Skill SDK	Mode A = `fukuro-audit` skill

Directory layout

~/Documents/Side.Projects/tako/server/
├── src/
│   ├── server_local.py          # MCP server (existing) — added kb_search_canon tool
│   ├── workspaces.py            # added _canon registration
│   └── routing.py               # added canon-first rule for spec/price intents
└── fukuro/
    ├── crawler.py               # Mode C daemon entrypoint
    ├── allowlist.yml            # source of truth for what can enter _canon
    ├── fetchers/
    │   ├── anthropic_docs.py
    │   ├── xai_docs.py
    │   ├── hf_model_card.py
    │   └── arxiv_pdf.py
    ├── parser.py                # BeautifulSoup extraction
    └── guard.py                 # allowlist enforcement (final check before write)

~/.claude/skills/fukuro-audit/
├── SKILL.md                     # Mode A skill definition
└── prompts/
    ├── ideation.md
    └── production.md

~/Library/LaunchAgents/
└── ai.vuihoc.fukuro-crawl.daily.plist

Allowlist

# fukuro/allowlist.yml
sources:
  - host: docs.anthropic.com
    pattern: "/**"
    cadence: daily
    fetcher: anthropic_docs

  - host: docs.x.ai
    pattern: "/**"
    cadence: daily
    fetcher: xai_docs

  - host: huggingface.co
    pattern: "/{org}/{model}"      # model cards ONLY, not blog/spaces/datasets
    cadence: daily
    fetcher: hf_model_card

  - host: arxiv.org
    pattern: "/pdf/*"
    cadence: on_reference          # one-shot, triggered when cited in _personal
    fetcher: arxiv_pdf

Explicitly excluded (parsed by guard, never crawled): blog posts (/blog/*), third-party benchmarks, Twitter/X threads, personal notes, GitHub repos.

Guard (the integrity contract)

# fukuro/guard.py
ALLOWED_HOSTS = {"docs.anthropic.com", "docs.x.ai", "huggingface.co", "arxiv.org"}

class IntegrityError(Exception): ...

def guard_url(url: str, allowlist: list[Rule]) -> None:
    """Final check before any _canon write. Raises on violation."""
    parsed = urlparse(url)
    if parsed.netloc not in ALLOWED_HOSTS:
        raise IntegrityError(f"host {parsed.netloc} not in allowlist")
    if not any(rule.matches(url) for rule in allowlist):
        raise IntegrityError(f"url {url} matches no allowlist pattern")

def guard_workspace_write(workspace: str, caller: str) -> None:
    """Enforced server-side in tako: only crawler service account writes _canon."""
    if workspace == "_canon" and caller != "fukuro-crawler":
        raise IntegrityError(f"caller {caller} cannot write _canon workspace")

Both checks fire on every ingest attempt. No bypass path. This is the integrity contract.

Workspace registration (tako server change)

# tako/src/workspaces.py — added _canon
WORKSPACE_MAP = {
    "ll":        {"path": "~/Documents/KB/ll/",        "writable_by": "*"},
    "mindx":     {"path": "~/Documents/KB/mindx/",     "writable_by": "*"},
    "_personal": {"path": "~/Documents/KB/_personal/", "writable_by": "*"},
    "_shared":   {"path": "~/Documents/KB/_shared/",   "writable_by": "*"},
    "_canon":    {"path": "~/Documents/KB-s3/_canon/", "writable_by": "fukuro-crawler"},
    "_secrets":  {"path": "~/Documents/KB/_secrets/",  "writable_by": "vault-only"},
}

And the new MCP tool:

@mcp.tool()
async def kb_search_canon(query: str, top_k: int = 5) -> list[dict]:
    """Search the _canon workspace (external vendor authoritative docs).
    Use for: AI vendor pricing, model specs, context windows, API parameters,
    quantization sizes, deprecation notices. NOT for Marc's personal notes."""
    qvec = embed_model.encode([f"query: {query}"])[0]
    return await db.search(workspace="_canon", qvec=qvec, top_k=top_k)

Routing rule (orchestrator playbook)

The playbook lives in tako’s instructions.py and is sent to every connecting client via serverInfo.instructions. Excerpt of the canon routing rule:

| AI vendor spec/price/version/context/API param | kb_search_canon FIRST,
  kb_search_personal as supplement only if canon empty or score < 0.65 |

_shared vs _canon distinction:
  _shared = Marc's own research notes + cheatsheets (subjective, curated)
  _canon  = external vendor authoritative spec (objective, crawled)

  e.g. "RAG pattern"              → _shared (Marc's interpretation)
       "Anthropic Haiku 4.5 price" → _canon  (vendor truth)

The classifier is intentionally simple: keyword + intent match. No LLM in the hot path.

Crawler internals

# fukuro/crawler.py — main loop
async def crawl_once(allowlist: list[Rule]) -> CrawlReport:
    report = CrawlReport()
    async with httpx.AsyncClient(
        headers={"User-Agent": "fukuro-crawler/0.1 (+marc personal)"},
        timeout=30.0,
        limits=httpx.Limits(max_connections=4),
    ) as client:
        for rule in allowlist:
            urls = await rule.fetcher.discover(client)  # sitemap or doc index
            for url in urls:
                guard_url(url, allowlist)                # fail fast
                try:
                    resp = await client.get(url, headers={
                        "If-Modified-Since": last_seen(url)
                    })
                    if resp.status_code == 304:
                        report.skipped_304 += 1
                        continue
                    if resp.status_code == 404:
                        report.errors.append(("404", url))
                        continue
                    content = parser.extract_main(resp.text)
                    await blob_store.put(
                        f"s3://canon/{rule.host}{urlparse(url).path}.html",
                        resp.content
                    )
                    await ingest_canon(url=url, content=content, rule=rule)
                    report.ingested += 1
                except httpx.HTTPError as e:
                    report.errors.append((str(e), url))
                await asyncio.sleep(1.0)                 # throttle 1 req/s/host
    return report

Mode A — `fukuro-audit` Skill

# ~/.claude/skills/fukuro-audit/SKILL.md
---
name: fukuro-audit
description: |
  Audit AI infrastructure — ideation hoặc production. Trigger:
  "fukuro audit ideation: <idea>" → JTBD/prior-art/scope/ROI/deps/alts rubric
  "fukuro audit production: <project-slug>" → 6 categories code scan + Grok 4.3 judge
  Outputs ADHD-friendly digest: health score + collapsible P0/P1/P2 findings.
---

Audit branches:

Ideation: rubric covering JTBD clarity, prior-art collision (via _canon search), scope realism, ROI estimate, dependency risks, alternatives considered.
Production: scans the project repo for AI config (model IDs, prompt files, API params), cross-checks against _canon for drift (e.g. deprecated model, price change, new better option), Grok 4.3 judges severity into P0/P1/P2.

Both branches cite _canon URLs so the user can verify every claim.

Performance numbers

Measured on MacBook Pro M2 Max (shared with tako daemon):

Operation	Number	Note
Daily crawl wall time	~6 min	~180 URLs, 1 req/s throttle, async I/O
Crawl HTTP rate	1 req/s/host	well under vendor rate limits
bge-m3 embed (shared)	0.39 chunks/s on CPU	already sunk cost; reused
`_canon` chunks inserted	~3,400	initial seed crawl
`_canon` storage	28 MB	small vs `_personal` (~3.2 GB)
Routing decision overhead	+12 ms p50	intent-classifier in orchestrator
`kb_search_canon` p50	~820 ms	comparable to `kb_search_personal` (shared store)
Crawler RAM	~80 MB	when running
Crawler CPU	<5% on M2 Max	during 6-min crawl

Reliability features

Feature	How
Idempotent re-crawl	SHA-256 hash check; skip when content unchanged
Delta-aware fetching	`If-Modified-Since` header; 304 short-circuits embed
Allowlist guard	Fires on every URL + every workspace write attempt
Raw HTML archive	MinIO mount preserves audit trail of crawled state
Auto-restart	launchd `KeepAlive` on crash
404 logging	Per-URL counter; surfaces vendor URL rot
Throttle	1 req/s per host enforced via `asyncio.sleep`

Security & integrity model

Concern	Mitigation
Allowlist tampering	`allowlist.yml` checked into git; daemon refuses to start if hash mismatch
Workspace write bypass	Server-side `guard_workspace_write()` rejects non-crawler callers
Filesystem ingest leak	Mount-watcher disabled for `_canon/` path
Manual MCP `kb_ingest` to `_canon`	Server returns `403 workspace_protected`
Vendor doc poisoning	Out of scope (trust the vendor); if vendor doc is wrong, that’s their bug
Crawler crash leaking partial data	Transactional upsert: chunks committed only after full source row

Cost

Item	Cost
Compute (M2 Max shared)	$0
Postgres + pgvector (local)	$0
MinIO (local)	$0
LLM in hot path	$0 (deterministic crawler; no LLM call)
LLM in audit (Mode A, optional)	~$0 per invocation (Grok 4.3 judge, sparse)
Total	~$0/month

Reproducibility — for a forker

# Prereqs: Personal-RAG (tako) already running with Postgres + pgvector + MinIO
cd ~/Documents/Side.Projects/tako/server
git pull                                          # tako v0.6.0-s3 or later

# 1. Register _canon workspace
psql ragkb -c "INSERT INTO workspaces (name, writable_by)
               VALUES ('_canon', 'fukuro-crawler');"

# 2. Drop in fukuro/ folder, edit allowlist.yml to your taste
cp -r fukuro/ ~/Documents/Side.Projects/tako/server/

# 3. Install Claude Skill
cp -r skills/fukuro-audit ~/.claude/skills/

# 4. Install launchd timer
cp launchd/ai.vuihoc.fukuro-crawl.daily.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/ai.vuihoc.fukuro-crawl.daily.plist

# 5. First crawl (manual)
python -m fukuro.crawler --once

Total: 30 min if Personal-RAG already running.

Future work

Add pm_canon workspace (Marty Cagan, Lenny, Reforge canon) — same daemon shape, different allowlist
Add design_canon (Linear, Notion, Apple HIG)
404-driven auto-deindex (when vendor removes a page)
Re-evaluate Mode B after 4 weeks of Mode A usage data
Selector-drift detector — auto-alert when parsed content shrinks >50% vs prior crawl

License & attribution

Personal project. Built on:

Personal-RAG / tako — workspace foundation
bge-m3 — shared embedder
Claude Skill SDK — Mode A

AI-Canon-Crawler — Implementation