Architecture

Sister docs: PRD (intent), Implementation (deep-dive), Notes (decision log).

System view

flowchart TB
    classDef client fill:#cce0e8,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef mode fill:#e0d5ed,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef server fill:#faedd6,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef store fill:#f4d6db,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px

    subgraph Clients["👤 Claude clients"]
        Desktop["Claude Desktop / Web / iOS"]
    end

    Desktop --> Orch

    subgraph Orch["🧭 MCP Orchestrator (tako server)"]
        Router["Routing rule
spec/price/version → canon first"]
        ToolCanon["kb_search_canon"]
        ToolPers["kb_search_personal"]
        ToolShared["kb_search_shared"]
        Router --> ToolCanon
        Router --> ToolPers
        Router --> ToolShared
    end

    subgraph ModeA["📘 Mode A — fukuro-audit Skill"]
        Audit["Claude Skill
ideation + production branches"]
    end

    subgraph ModeC["🤖 Mode C — Crawler daemon"]
        Timer["launchd daily timer"]
        Fetch["httpx + BeautifulSoup"]
        Allow["Allowlist guard
(only writer to _canon)"]
        Embed["bge-m3 (shared)"]
        Timer --> Fetch
        Fetch --> Allow
        Allow --> Embed
    end

    Audit -.invokes.-> ToolCanon

    subgraph DB["🗄️ Postgres 16 + pgvector"]
        Canon["_canon workspace
~180 docs · 3,400 chunks · 28MB"]
        Personal["_personal · _shared · ..."]
    end

    subgraph S3["📦 MinIO S3 (BlobStore mount)"]
        RawHTML["Raw HTML archive
s3://canon//.html"]
    end

    Embed --> Canon
    Fetch --> RawHTML
    ToolCanon --> Canon
    ToolPers --> Personal
    ToolShared --> Personal

    class Desktop client
    class Audit,Timer,Fetch,Allow,Embed mode
    class Router,ToolCanon,ToolPers,ToolShared server
    class Canon,Personal,RawHTML store

Two-mode design (post scope-cut)

Mode	What it is	When it runs	Trigger
Mode A	`fukuro-audit` Claude Skill — audits ideation or production projects against canon facts	On user invocation	User types skill name in Claude
Mode B ❌	~~Continuous evidence gathering during conversation~~	—	Dropped — 1 data point insufficient; overlap with `kb-audit` weekly job
Mode C	Crawler daemon — only writer to `_canon`	Daily, ~6 min wall	launchd timer

Mode A reads _canon via kb_search_canon; never writes. Mode C writes _canon exclusively; never reads at query time.

This separation is the integrity contract: the workspace cannot be corrupted by manual additions.

Data flow — Crawl (Mode C)

[Daily launchd timer fires]
        │
        ▼
[Crawler loads allowlist.yml]
    - docs.anthropic.com/*
    - docs.x.ai/*
    - huggingface.co/<org>/<model>
    - arXiv PDFs cited in _personal (one-shot)
        │
        ▼
[For each allowlist entry — async httpx]
    GET <url> with custom User-Agent + If-Modified-Since header
        │
        ├─ 304 Not Modified  →  skip (delta-aware)
        ├─ 404 Not Found     →  log + flag for human review
        └─ 200 OK            →  proceed
        │
        ▼
[Allowlist guard (final check before write)]
    assert url matches allowlist pattern
    assert host in {docs.anthropic.com, docs.x.ai, huggingface.co, arxiv.org}
    else: raise IntegrityError (kills the crawl, alerts log)
        │
        ▼
[Parse — BeautifulSoup]
    extract main content, strip nav/footer/ads
    derive title, canonical_url, last_modified
        │
        ▼
[Archive raw HTML]
    PUT s3://canon/<host>/<path>.html (MinIO mount)
        │
        ▼
[Hash check (shared tako logic)]
    SHA256(content) → compare kb_sources.content_hash
        ├─ same   → skip embed, mark crawled_at
        └─ differ → DELETE old chunks, re-embed
        │
        ▼
[Embed via shared bge-m3]
    chunk_text() → list[str]
    encode(["passage: " + c]) → vectors
        │
        ▼
[Upsert into Postgres _canon workspace]
    INSERT/UPDATE kb_sources (workspace='_canon', source_type='vendor_doc')
    INSERT kb_chunks bulk
    COMMIT
        │
        ▼
[Log to ~/.claude/hooks/fukuro-crawl.log]
    {crawled: N, skipped_304: M, updated: K, errors: [...]}

Data flow — Query (routing)

[1] User → Claude:
    "What's the current Haiku 4.5 input price?"

           │
           ▼
[2] MCP orchestrator routing rule (server-side playbook):
    intent_classifier(query) → matches {spec, price, version, context_window,
                                         API_parameter, model_card}
           │
           ▼
[3] Route decision:
    if intent ∈ canon_intents → call kb_search_canon FIRST
    else                       → call kb_search_personal / kb_search_shared

           │
           ▼
[4] kb_search_canon executes:
    qvec = bge-m3.encode(["query: " + user_query])[0]
    SELECT ... FROM kb_chunks JOIN kb_sources
     WHERE workspace = '_canon'
     ORDER BY vector_distance(...) LIMIT top_k

           │
           ▼
[5] Fallback rule:
    if results empty OR top score < 0.65:
        ALSO call kb_search_personal as supplement

           │
           ▼
[6] Return chunks → Claude synthesizes with vendor URL citation

Data flow — Audit (Mode A)

[User invokes fukuro-audit skill in Claude]
        │
        ▼
[Skill parses args]
    "fukuro audit ideation: <idea>"   → JTBD/prior-art/scope/ROI rubric
    "fukuro audit production: <slug>" → 6-category code scan + judge

        │
        ▼
[For ideation branch]
    - Call kb_search_canon for vendor facts referenced in idea
    - Cross-check claims (e.g. "uses Claude 4.7" → verify model exists)
    - Output rubric with citations

[For production branch]
    - Scan project repo for AI infra files (config, prompts, model IDs)
    - Cross-check against _canon (model deprecation, price drift, API changes)
    - Grok 4.3 judges severity → P0/P1/P2 findings

        │
        ▼
[ADHD-friendly digest returned to user]
    - Health score (0-100)
    - Collapsible P0/P1/P2 sections
    - Each finding cites _canon source URL

Component responsibilities

Component	Owns	Doesn’t own
Crawler daemon (Mode C)	Allowlist enforcement, fetching, parsing, archiving, upserting `_canon`	Routing, query-time logic
Allowlist guard	Final integrity check before any `_canon` write	Source discovery
`fukuro-audit` skill (Mode A)	Audit rubric, severity classification, digest formatting	Crawling, writing to `_canon`
MCP orchestrator	Routing decision (canon-first for spec/price intents), fallback rule	Crawling, source curation
`_canon` workspace	Vendor-truth storage, overwrite-on-recrawl	Personal notes, curated cheatsheets
Shared bge-m3	Embedding both passages and queries	Storage, retrieval
MinIO BlobStore	Raw HTML archive (audit trail of what was crawled)	Live serving
launchd	Daily timer, restart on crash	Crawl logic

Workspace boundary — why this matters

The _canon workspace has one writer (the crawler) and one input source (the allowlist). Every other path is rejected:

Attempted write path	Result
Manual `kb_ingest` MCP call with `workspace='_canon'`	Rejected at server: only crawler service account can write `_canon`
File written to KB mount under `_canon/` folder	Rejected at mount-watcher: `_canon` ingest disabled for filesystem path
Crawler fetches a URL not in allowlist	Rejected at allowlist guard: raises `IntegrityError`, crawl halts
Personal note tagged `vendor_doc`	Stored in `_personal` workspace; tag ignored for routing

This is the integrity contract. Once any leak path opens, the workspace’s value collapses within weeks (the original problem).

Failure modes & recovery

Failure	Detect	Recovery	Time
Vendor URL 404 (page moved)	Crawler logs + count metric	Manual allowlist patch	minutes
Vendor changes HTML structure (parser breaks)	Empty content extracted, hash diff anomaly	Update BeautifulSoup selectors	<30 min
Crawl exceeds rate limit (HTTP 429)	httpx logs	Throttle backoff (already implemented at 1 req/s)	next run
MinIO mount unavailable	BlobStore write fails	Crawler retries next tick; Postgres upsert still proceeds	next run
Postgres down	All ingests fail	Restart tako daemon	<1 min
`_canon` data corruption	Audit script diffs hash on rotation	Re-crawl from allowlist (idempotent)	~6 min
Allowlist file deleted	Crawler fails-fast	Restore from git	<1 min

Why these choices

Decision	Alternative considered	Why this won
Separate workspace, not a tag	`source_type=vendor_doc` + re-rank boost	Routing decision happens before retrieval — stronger than post-hoc boost. Classifier-tag flow missed ~15%, fragile.
Crawler-only writer	Allow manual `_canon` ingest with vendor URL	One leaky source contaminates the workspace within weeks; the constraint is the product.
Allowlist over auto-discovery	Crawl any URL the user mentions	Auto-discovery has no integrity boundary; the allowlist is the canon’s definition.
Daily cadence	Realtime crawl on conversation	12 ms routing overhead vs minutes per query; daily is enough for vendor docs (they change weekly at most).
Routing in orchestrator, not prompt	Tell Claude “prefer canon when spec question”	Prompts drift; routing rules don’t. Server-side enforcement = consistent across all clients.
Shared bge-m3	Dedicated canon embedder	Same query embedder must be used for both workspaces to make scores comparable in fallback.
Drop Mode B	Build continuous evidence gathering	1 data point of demand; overlap with weekly `kb-audit`. Re-evaluate after Mode A/C usage.

AI-Canon-Crawler — Architecture