Personal RAG Knowledge Base — PRD

Size S · P0 · Foundation Status: ✅ M1 done (2026-04-29), S3 migration done (2026-05-24) — see Implementation for build details Originally planned: 1 weekend / Actual: ~2 days concentrated work + ongoing iteration

1. Problem

Personal knowledge is fragmented across many sources: wiki/documentation pages, ticketing systems, chat threads, meeting transcripts, email, markdown notes, and AI conversation transcripts. When trying to recall “what did I read about X?”, finding the answer takes 15–30 minutes of manual hunting across separate tools. OS-level search (Spotlight) is keyword-only and doesn’t understand semantic intent. Per-source MCP search is slow and bloats the context window with raw chunks.

Pain: ~90% of consumed knowledge is not retrievable on demand.

Why now: this is a foundation pattern that 6+ downstream AI projects can reuse (recipe extractor, research agent, support bot, finance advisor, etc.). Build once, reuse many times. The corpus is also multi-workspace by nature — work knowledge, consulting knowledge, personal notes, and external vendor reference docs each have different trust tiers and need scoped access.

2. Goal & Success Metrics

Goal: Ask a Claude client “any thoughts on vector DB X for ~100K vectors?” → get top-5 relevant chunks + source links in under 3 seconds, scoped to the right workspace automatically.

Metrics — actual achieved:

Metric	Target M1	Achieved	Note
Hit@3 on held-out eval	≥80%	97.8%	93-query personal-workspace eval (2026-05-21)
MRR on held-out eval	≥0.85	0.948	Same eval set
Latency p95 warm	<3s	840 ms	Embed query + ANN search + rerank + tunnel
Latency p95 cold	<5s	2.3 s	Includes model load
Sources ingested	100 docs	42,000+ sources / 182,000+ chunks	Across 6 workspaces
Workspaces	1	6	ll / mindx / _personal / _shared / _canon / _secrets
Touchpoint	Telegram	MCP: Claude Desktop + Claude.ai web + iOS app	Pivoted Day 1, validated Day 8
Resource footprint	n/a	388 MB idle / 501 MB active	Local MBP M2 Max

3. User journey (revised)

Pivot from original: Dropped a custom Telegram bot in favor of MCP — Claude clients can call workspace-scoped tools (kb_search_ll, kb_search_personal, etc.) natively.

A sync agent (chat exporter, meeting transcriber, AI session hook, ticket crawler, vendor-doc crawler) writes a .md file into the local KB mount path.
A filesystem watcher (tako-mount-watcher, fswatch-driven) detects the write and calls kb-ingest-file.py → chunks + embeds + stores.
User asks Claude: “Anything new on topic X this week?”
Claude routes to the right scoped tool (e.g. kb_search_ll with client_filter='PCF' when query mentions PCF) → returns top-K chunks with source URIs.
Claude synthesizes an answer with citations.

4. Scope (MoSCoW) — final

Must — DONE:

✅ Ingest markdown files (URL/PDF supported via downstream sync agents)
✅ Chunk + embed + store in Postgres 16 + pgvector with HNSW
✅ MCP tools per workspace: kb_health, kb_ingest, kb_search_*, kb_stats
✅ Citations: search results include source_uri and chunk_idx
✅ Multi-workspace isolation with scoped tools

Should — DONE:

✅ Idempotent ingest via server-side SHA-256 hash check
✅ Re-ingest replaces chunks if content hash changed
✅ Auto-tag from folder hierarchy (path-based classifier)
✅ Reranker stage (bge-reranker-v2-m3) on top-N — marginal +3.2pp Hit@1, shipped as default
⏸️ BM25 + vector hybrid — semantic + reranker proved sufficient on the eval set

Could — partial:

⏸️ Apple Notes import — out of scope, low ROI for actual usage
⏸️ Kindle highlights import — same reason
✅ Wiki/docs crawler integration — handled by downstream scripts
✅ Vendor-doc crawler (Mode C of the audit/canon pipeline) — populates _canon workspace
❌ Custom web UI — replaced by Claude clients themselves

Won’t (M1–M3) — kept:

Multi-user support (single-user system by design)
Real-time sync — fswatch + hourly mirror is sufficient
Native mobile app — Claude iOS app inherits via OAuth

5. Architecture (final)

Pivoted from “RAG + Telegram Bot” → MCP Server + Multi-workspace scoped retrieval + Filesystem-first ingest. See Architecture for diagrams.

6. Tech Stack — final choices

Layer	Original spec	Implemented	Reason for change
LLM serving	Local Llama 3.2 3B	External (caller’s Claude client)	MCP delegates LLM to caller; server only does retrieval
Embedder	BGE-small-en-v1.5	bge-m3 (multilingual)	English-only embedder underperformed on mixed VN/EN; bge-m3 handles VN+EN+code-mixed
Reranker	(deferred)	bge-reranker-v2-m3 (cross-encoder)	+3.2pp Hit@1 on eval set, ~negligible latency at top-20
Vector DB	Oracle ADB 23ai	Postgres 16 + pgvector (HNSW)	Local deploy, no cloud dependency for the hot path, fits 3.2 GB comfortably
Object storage	(none)	MinIO S3 primary + filesystem mirror fallback	BlobStore abstraction; dual-scheme URIs (`s3://` + `file://`)
Bot framework	python-telegram-bot	MCP Streamable HTTP	Native Claude integration, multi-client for free
HTTP framework	—	FastMCP + Starlette + uvicorn	MCP SDK provides this out-of-box
Tunnel	Public IP + nginx	Cloudflare named tunnel	Persistent URL, no inbound port open; optional, used only for remote-device access
Auth	”Telegram only”	OAuth 2.0 (PKCE+DCR) + legacy bearer	Supports mobile/web/desktop clients
Runtime host	Cloud VM	Local MacBook Pro M2 Max	Low latency, no cloud bill, hardware already paid for; daemon runs under launchd
Auto-ingest	systemd cron	launchd `tako-mount-watcher` (fswatch)	Real-time, no polling

Cost posture: $0/month for compute (own hardware) + $0 for Cloudflare tunnel. No managed services in the hot path. The architecture intentionally avoids vendor lock-in — every component is OSS-replaceable.

7. Milestones — actual

Day	What shipped
Day 1	MCP server scaffold (FastMCP), tunnel, bearer auth
Day 2	`kb_ingest` / `kb_search` / `kb_stats` tools, embed bench, schema applied
Day 3	Bulk migrate first 5,000+ sources (~95 min wall)
Day 3b	Embed model swap to multilingual model (VN/EN parity)
Day 4-5	Refactor sync sources to call MCP `kb_ingest` (filesystem-first rule)
Day 6	Persistent named tunnel via custom domain
Day 7	Weekly backup + restore script
Day 8	OAuth 2.0 + DCR + PKCE → Claude.ai web + iOS app access
2026-05-x	Multi-workspace refactor: 6 workspaces + scoped MCP tools
2026-05-21	Reranker shipped (bge-reranker-v2-m3); Hit@3 = 97.8% on held-out eval
2026-05-24	S3 migration: BlobStore abstraction (MinIO primary + FS mirror); v0.6.0-s3
2026-05-25	`_canon` workspace + `kb_search_canon` tool for vendor authoritative docs

M1 DoD passed:

✅ Ingest 5,000+ sources (target 10) — now 42K+
✅ Real queries answered correctly — Hit@3 = 97.8% on held-out eval
✅ Latency 840 ms p95 warm (target <5s)

8. Cost & Quota

Item	Free?	Actual usage
Postgres 16 + pgvector (local)	✅	3.2 GB on disk
MinIO S3 (local)	✅	~50 GB blob mirror
Cloudflare named tunnel	✅	<1 MB/day data
Compute (MBP M2 Max)	✅ (owned hw)	388 MB idle / 501 MB active
Daily Postgres backup (`tako-pg-backup`)	✅	local
Hourly FS mirror (`tako-fs-backup`)	✅	local

No cloud serving cost. The optional Cloudflare tunnel is only used when accessing from a non-host device (phone, other laptop).

9. Risks & open questions — outcomes

Original risks:

Cloud DB free-tier eviction → eliminated by going local Postgres
Embedding quality on Vietnamese text → resolved with bge-m3 (multilingual SOTA at 568M params)
Local LLM crash recovery → N/A (LLM not used in final architecture)

Current risks:

Single-host SPOF — mitigated by hourly FS mirror + daily Postgres dump; restore tested
Cloudflare bot management blocking default UAs → fixed with custom UA header
Token rotation reliance on operator memory — push to password manager

Original open Qs:

Q1: Web UI from M2? → ❌ dropped, Claude clients are sufficient
Q2: Privacy of sensitive content? → ✅ accepted, single-tenant local deployment; _secrets workspace stays out of the MCP surface
Q3: Reranker latency on CPU? → resolved, shipped 2026-05-21

10. Definition of Done

M1 Done: ✅ 2026-04-29 — initial corpus ingested, OAuth flow live, multilingual search working, DR backup in place.

S3 milestone done: ✅ 2026-05-24 — BlobStore abstraction shipped, 1000-row regression at 100% parity, hourly + daily backups via launchd.

M3 Done (production-ready):

⏳ TOTP 2FA or SSO wrap on /login
✅ Reranker eval on top-N candidates → measured +3.2pp Hit@1
✅ Daily-driver criterion: in continuous personal use across 4+ weeks without architecture pivot

Personal-RAG — PRD