← Back to project
● Shipped P1 Size S Foundation

AI-Canon-Crawler — Architecture

Two-mode design, crawl pipeline, routing decision, workspace boundary.

Architecture

Sister docs: PRD (intent), Implementation (deep-dive), Notes (decision log).

System view

flowchart TB
    classDef client fill:#cce0e8,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef mode fill:#e0d5ed,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef server fill:#faedd6,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
    classDef store fill:#f4d6db,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px

    subgraph Clients["👤 Claude clients"]
        Desktop["Claude Desktop / Web / iOS"]
    end

    Desktop --> Orch

    subgraph Orch["🧭 MCP Orchestrator (tako server)"]
        Router["Routing rule
spec/price/version → canon first"] ToolCanon["kb_search_canon"] ToolPers["kb_search_personal"] ToolShared["kb_search_shared"] Router --> ToolCanon Router --> ToolPers Router --> ToolShared end subgraph ModeA["📘 Mode A — fukuro-audit Skill"] Audit["Claude Skill
ideation + production branches"] end subgraph ModeC["🤖 Mode C — Crawler daemon"] Timer["launchd daily timer"] Fetch["httpx + BeautifulSoup"] Allow["Allowlist guard
(only writer to _canon)"] Embed["bge-m3 (shared)"] Timer --> Fetch Fetch --> Allow Allow --> Embed end Audit -.invokes.-> ToolCanon subgraph DB["🗄️ Postgres 16 + pgvector"] Canon["_canon workspace
~180 docs · 3,400 chunks · 28MB"] Personal["_personal · _shared · ..."] end subgraph S3["📦 MinIO S3 (BlobStore mount)"] RawHTML["Raw HTML archive
s3://canon//.html"] end Embed --> Canon Fetch --> RawHTML ToolCanon --> Canon ToolPers --> Personal ToolShared --> Personal class Desktop client class Audit,Timer,Fetch,Allow,Embed mode class Router,ToolCanon,ToolPers,ToolShared server class Canon,Personal,RawHTML store

Two-mode design (post scope-cut)

ModeWhat it isWhen it runsTrigger
Mode Afukuro-audit Claude Skill — audits ideation or production projects against canon factsOn user invocationUser types skill name in Claude
Mode BContinuous evidence gathering during conversationDropped — 1 data point insufficient; overlap with kb-audit weekly job
Mode CCrawler daemon — only writer to _canonDaily, ~6 min walllaunchd timer

Mode A reads _canon via kb_search_canon; never writes. Mode C writes _canon exclusively; never reads at query time.

This separation is the integrity contract: the workspace cannot be corrupted by manual additions.

Data flow — Crawl (Mode C)

[Daily launchd timer fires]


[Crawler loads allowlist.yml]
    - docs.anthropic.com/*
    - docs.x.ai/*
    - huggingface.co/<org>/<model>
    - arXiv PDFs cited in _personal (one-shot)


[For each allowlist entry — async httpx]
    GET <url> with custom User-Agent + If-Modified-Since header

        ├─ 304 Not Modified  →  skip (delta-aware)
        ├─ 404 Not Found     →  log + flag for human review
        └─ 200 OK            →  proceed


[Allowlist guard (final check before write)]
    assert url matches allowlist pattern
    assert host in {docs.anthropic.com, docs.x.ai, huggingface.co, arxiv.org}
    else: raise IntegrityError (kills the crawl, alerts log)


[Parse — BeautifulSoup]
    extract main content, strip nav/footer/ads
    derive title, canonical_url, last_modified


[Archive raw HTML]
    PUT s3://canon/<host>/<path>.html (MinIO mount)


[Hash check (shared tako logic)]
    SHA256(content) → compare kb_sources.content_hash
        ├─ same   → skip embed, mark crawled_at
        └─ differ → DELETE old chunks, re-embed


[Embed via shared bge-m3]
    chunk_text() → list[str]
    encode(["passage: " + c]) → vectors


[Upsert into Postgres _canon workspace]
    INSERT/UPDATE kb_sources (workspace='_canon', source_type='vendor_doc')
    INSERT kb_chunks bulk
    COMMIT


[Log to ~/.claude/hooks/fukuro-crawl.log]
    {crawled: N, skipped_304: M, updated: K, errors: [...]}

Data flow — Query (routing)

[1] User → Claude:
    "What's the current Haiku 4.5 input price?"



[2] MCP orchestrator routing rule (server-side playbook):
    intent_classifier(query) → matches {spec, price, version, context_window,
                                         API_parameter, model_card}


[3] Route decision:
    if intent ∈ canon_intents → call kb_search_canon FIRST
    else                       → call kb_search_personal / kb_search_shared



[4] kb_search_canon executes:
    qvec = bge-m3.encode(["query: " + user_query])[0]
    SELECT ... FROM kb_chunks JOIN kb_sources
     WHERE workspace = '_canon'
     ORDER BY vector_distance(...) LIMIT top_k



[5] Fallback rule:
    if results empty OR top score < 0.65:
        ALSO call kb_search_personal as supplement



[6] Return chunks → Claude synthesizes with vendor URL citation

Data flow — Audit (Mode A)

[User invokes fukuro-audit skill in Claude]


[Skill parses args]
    "fukuro audit ideation: <idea>"   → JTBD/prior-art/scope/ROI rubric
    "fukuro audit production: <slug>" → 6-category code scan + judge



[For ideation branch]
    - Call kb_search_canon for vendor facts referenced in idea
    - Cross-check claims (e.g. "uses Claude 4.7" → verify model exists)
    - Output rubric with citations

[For production branch]
    - Scan project repo for AI infra files (config, prompts, model IDs)
    - Cross-check against _canon (model deprecation, price drift, API changes)
    - Grok 4.3 judges severity → P0/P1/P2 findings



[ADHD-friendly digest returned to user]
    - Health score (0-100)
    - Collapsible P0/P1/P2 sections
    - Each finding cites _canon source URL

Component responsibilities

ComponentOwnsDoesn’t own
Crawler daemon (Mode C)Allowlist enforcement, fetching, parsing, archiving, upserting _canonRouting, query-time logic
Allowlist guardFinal integrity check before any _canon writeSource discovery
fukuro-audit skill (Mode A)Audit rubric, severity classification, digest formattingCrawling, writing to _canon
MCP orchestratorRouting decision (canon-first for spec/price intents), fallback ruleCrawling, source curation
_canon workspaceVendor-truth storage, overwrite-on-recrawlPersonal notes, curated cheatsheets
Shared bge-m3Embedding both passages and queriesStorage, retrieval
MinIO BlobStoreRaw HTML archive (audit trail of what was crawled)Live serving
launchdDaily timer, restart on crashCrawl logic

Workspace boundary — why this matters

The _canon workspace has one writer (the crawler) and one input source (the allowlist). Every other path is rejected:

Attempted write pathResult
Manual kb_ingest MCP call with workspace='_canon'Rejected at server: only crawler service account can write _canon
File written to KB mount under _canon/ folderRejected at mount-watcher: _canon ingest disabled for filesystem path
Crawler fetches a URL not in allowlistRejected at allowlist guard: raises IntegrityError, crawl halts
Personal note tagged vendor_docStored in _personal workspace; tag ignored for routing

This is the integrity contract. Once any leak path opens, the workspace’s value collapses within weeks (the original problem).

Failure modes & recovery

FailureDetectRecoveryTime
Vendor URL 404 (page moved)Crawler logs + count metricManual allowlist patchminutes
Vendor changes HTML structure (parser breaks)Empty content extracted, hash diff anomalyUpdate BeautifulSoup selectors<30 min
Crawl exceeds rate limit (HTTP 429)httpx logsThrottle backoff (already implemented at 1 req/s)next run
MinIO mount unavailableBlobStore write failsCrawler retries next tick; Postgres upsert still proceedsnext run
Postgres downAll ingests failRestart tako daemon<1 min
_canon data corruptionAudit script diffs hash on rotationRe-crawl from allowlist (idempotent)~6 min
Allowlist file deletedCrawler fails-fastRestore from git<1 min

Why these choices

DecisionAlternative consideredWhy this won
Separate workspace, not a tagsource_type=vendor_doc + re-rank boostRouting decision happens before retrieval — stronger than post-hoc boost. Classifier-tag flow missed ~15%, fragile.
Crawler-only writerAllow manual _canon ingest with vendor URLOne leaky source contaminates the workspace within weeks; the constraint is the product.
Allowlist over auto-discoveryCrawl any URL the user mentionsAuto-discovery has no integrity boundary; the allowlist is the canon’s definition.
Daily cadenceRealtime crawl on conversation12 ms routing overhead vs minutes per query; daily is enough for vendor docs (they change weekly at most).
Routing in orchestrator, not promptTell Claude “prefer canon when spec question”Prompts drift; routing rules don’t. Server-side enforcement = consistent across all clients.
Shared bge-m3Dedicated canon embedderSame query embedder must be used for both workspaces to make scores comparable in fallback.
Drop Mode BBuild continuous evidence gathering1 data point of demand; overlap with weekly kb-audit. Re-evaluate after Mode A/C usage.

See also

  • PRD for scope-cut reasoning on Mode B
  • Implementation for allowlist code + classifier wiring