Architecture
Sister docs: PRD (intent), Implementation (deep-dive), Notes (decision log).
System view
flowchart TB
classDef client fill:#cce0e8,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
classDef mode fill:#e0d5ed,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
classDef server fill:#faedd6,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
classDef store fill:#f4d6db,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
subgraph Clients["👤 Claude clients"]
Desktop["Claude Desktop / Web / iOS"]
end
Desktop --> Orch
subgraph Orch["🧭 MCP Orchestrator (tako server)"]
Router["Routing rule
spec/price/version → canon first"]
ToolCanon["kb_search_canon"]
ToolPers["kb_search_personal"]
ToolShared["kb_search_shared"]
Router --> ToolCanon
Router --> ToolPers
Router --> ToolShared
end
subgraph ModeA["📘 Mode A — fukuro-audit Skill"]
Audit["Claude Skill
ideation + production branches"]
end
subgraph ModeC["🤖 Mode C — Crawler daemon"]
Timer["launchd daily timer"]
Fetch["httpx + BeautifulSoup"]
Allow["Allowlist guard
(only writer to _canon)"]
Embed["bge-m3 (shared)"]
Timer --> Fetch
Fetch --> Allow
Allow --> Embed
end
Audit -.invokes.-> ToolCanon
subgraph DB["🗄️ Postgres 16 + pgvector"]
Canon["_canon workspace
~180 docs · 3,400 chunks · 28MB"]
Personal["_personal · _shared · ..."]
end
subgraph S3["📦 MinIO S3 (BlobStore mount)"]
RawHTML["Raw HTML archive
s3://canon//.html"]
end
Embed --> Canon
Fetch --> RawHTML
ToolCanon --> Canon
ToolPers --> Personal
ToolShared --> Personal
class Desktop client
class Audit,Timer,Fetch,Allow,Embed mode
class Router,ToolCanon,ToolPers,ToolShared server
class Canon,Personal,RawHTML store
Two-mode design (post scope-cut)
| Mode | What it is | When it runs | Trigger |
|---|---|---|---|
| Mode A | fukuro-audit Claude Skill — audits ideation or production projects against canon facts | On user invocation | User types skill name in Claude |
| Mode B ❌ | — | Dropped — 1 data point insufficient; overlap with kb-audit weekly job | |
| Mode C | Crawler daemon — only writer to _canon | Daily, ~6 min wall | launchd timer |
Mode A reads _canon via kb_search_canon; never writes. Mode C writes _canon exclusively; never reads at query time.
This separation is the integrity contract: the workspace cannot be corrupted by manual additions.
Data flow — Crawl (Mode C)
[Daily launchd timer fires]
│
▼
[Crawler loads allowlist.yml]
- docs.anthropic.com/*
- docs.x.ai/*
- huggingface.co/<org>/<model>
- arXiv PDFs cited in _personal (one-shot)
│
▼
[For each allowlist entry — async httpx]
GET <url> with custom User-Agent + If-Modified-Since header
│
├─ 304 Not Modified → skip (delta-aware)
├─ 404 Not Found → log + flag for human review
└─ 200 OK → proceed
│
▼
[Allowlist guard (final check before write)]
assert url matches allowlist pattern
assert host in {docs.anthropic.com, docs.x.ai, huggingface.co, arxiv.org}
else: raise IntegrityError (kills the crawl, alerts log)
│
▼
[Parse — BeautifulSoup]
extract main content, strip nav/footer/ads
derive title, canonical_url, last_modified
│
▼
[Archive raw HTML]
PUT s3://canon/<host>/<path>.html (MinIO mount)
│
▼
[Hash check (shared tako logic)]
SHA256(content) → compare kb_sources.content_hash
├─ same → skip embed, mark crawled_at
└─ differ → DELETE old chunks, re-embed
│
▼
[Embed via shared bge-m3]
chunk_text() → list[str]
encode(["passage: " + c]) → vectors
│
▼
[Upsert into Postgres _canon workspace]
INSERT/UPDATE kb_sources (workspace='_canon', source_type='vendor_doc')
INSERT kb_chunks bulk
COMMIT
│
▼
[Log to ~/.claude/hooks/fukuro-crawl.log]
{crawled: N, skipped_304: M, updated: K, errors: [...]}
Data flow — Query (routing)
[1] User → Claude:
"What's the current Haiku 4.5 input price?"
│
▼
[2] MCP orchestrator routing rule (server-side playbook):
intent_classifier(query) → matches {spec, price, version, context_window,
API_parameter, model_card}
│
▼
[3] Route decision:
if intent ∈ canon_intents → call kb_search_canon FIRST
else → call kb_search_personal / kb_search_shared
│
▼
[4] kb_search_canon executes:
qvec = bge-m3.encode(["query: " + user_query])[0]
SELECT ... FROM kb_chunks JOIN kb_sources
WHERE workspace = '_canon'
ORDER BY vector_distance(...) LIMIT top_k
│
▼
[5] Fallback rule:
if results empty OR top score < 0.65:
ALSO call kb_search_personal as supplement
│
▼
[6] Return chunks → Claude synthesizes with vendor URL citation
Data flow — Audit (Mode A)
[User invokes fukuro-audit skill in Claude]
│
▼
[Skill parses args]
"fukuro audit ideation: <idea>" → JTBD/prior-art/scope/ROI rubric
"fukuro audit production: <slug>" → 6-category code scan + judge
│
▼
[For ideation branch]
- Call kb_search_canon for vendor facts referenced in idea
- Cross-check claims (e.g. "uses Claude 4.7" → verify model exists)
- Output rubric with citations
[For production branch]
- Scan project repo for AI infra files (config, prompts, model IDs)
- Cross-check against _canon (model deprecation, price drift, API changes)
- Grok 4.3 judges severity → P0/P1/P2 findings
│
▼
[ADHD-friendly digest returned to user]
- Health score (0-100)
- Collapsible P0/P1/P2 sections
- Each finding cites _canon source URL
Component responsibilities
| Component | Owns | Doesn’t own |
|---|---|---|
| Crawler daemon (Mode C) | Allowlist enforcement, fetching, parsing, archiving, upserting _canon | Routing, query-time logic |
| Allowlist guard | Final integrity check before any _canon write | Source discovery |
fukuro-audit skill (Mode A) | Audit rubric, severity classification, digest formatting | Crawling, writing to _canon |
| MCP orchestrator | Routing decision (canon-first for spec/price intents), fallback rule | Crawling, source curation |
_canon workspace | Vendor-truth storage, overwrite-on-recrawl | Personal notes, curated cheatsheets |
| Shared bge-m3 | Embedding both passages and queries | Storage, retrieval |
| MinIO BlobStore | Raw HTML archive (audit trail of what was crawled) | Live serving |
| launchd | Daily timer, restart on crash | Crawl logic |
Workspace boundary — why this matters
The _canon workspace has one writer (the crawler) and one input source (the allowlist). Every other path is rejected:
| Attempted write path | Result |
|---|---|
Manual kb_ingest MCP call with workspace='_canon' | Rejected at server: only crawler service account can write _canon |
File written to KB mount under _canon/ folder | Rejected at mount-watcher: _canon ingest disabled for filesystem path |
| Crawler fetches a URL not in allowlist | Rejected at allowlist guard: raises IntegrityError, crawl halts |
Personal note tagged vendor_doc | Stored in _personal workspace; tag ignored for routing |
This is the integrity contract. Once any leak path opens, the workspace’s value collapses within weeks (the original problem).
Failure modes & recovery
| Failure | Detect | Recovery | Time |
|---|---|---|---|
| Vendor URL 404 (page moved) | Crawler logs + count metric | Manual allowlist patch | minutes |
| Vendor changes HTML structure (parser breaks) | Empty content extracted, hash diff anomaly | Update BeautifulSoup selectors | <30 min |
| Crawl exceeds rate limit (HTTP 429) | httpx logs | Throttle backoff (already implemented at 1 req/s) | next run |
| MinIO mount unavailable | BlobStore write fails | Crawler retries next tick; Postgres upsert still proceeds | next run |
| Postgres down | All ingests fail | Restart tako daemon | <1 min |
_canon data corruption | Audit script diffs hash on rotation | Re-crawl from allowlist (idempotent) | ~6 min |
| Allowlist file deleted | Crawler fails-fast | Restore from git | <1 min |
Why these choices
| Decision | Alternative considered | Why this won |
|---|---|---|
| Separate workspace, not a tag | source_type=vendor_doc + re-rank boost | Routing decision happens before retrieval — stronger than post-hoc boost. Classifier-tag flow missed ~15%, fragile. |
| Crawler-only writer | Allow manual _canon ingest with vendor URL | One leaky source contaminates the workspace within weeks; the constraint is the product. |
| Allowlist over auto-discovery | Crawl any URL the user mentions | Auto-discovery has no integrity boundary; the allowlist is the canon’s definition. |
| Daily cadence | Realtime crawl on conversation | 12 ms routing overhead vs minutes per query; daily is enough for vendor docs (they change weekly at most). |
| Routing in orchestrator, not prompt | Tell Claude “prefer canon when spec question” | Prompts drift; routing rules don’t. Server-side enforcement = consistent across all clients. |
| Shared bge-m3 | Dedicated canon embedder | Same query embedder must be used for both workspaces to make scores comparable in fallback. |
| Drop Mode B | Build continuous evidence gathering | 1 data point of demand; overlap with weekly kb-audit. Re-evaluate after Mode A/C usage. |
See also
- PRD for scope-cut reasoning on Mode B
- Implementation for allowlist code + classifier wiring