Open notebook
What I'm learning about LLMs, prompt engineering, RAG, MCP, and the rest. Half memo to my future self, half "maybe useful to someone else." All in the open.
Harness Engineering: The Scaffolding Is the Product, Not the Model
I ran two AI coding agents — Claude Code and Antigravity — through 100 turns of pair-programming over a Redis Streams channel I built, while they wrote real code against a live Redis. The code wasn't the lesson. What kept them working together was the harness. Field notes on engineering the scaffolding around an LLM.
AI-Canon-Crawler: why I gave my RAG a separate workspace for vendor truth
Personal-RAG kept hallucinating vendor specs because Marc's notes and Anthropic's docs lived in the same index. The fix: a dedicated canon workspace, a crawler that only ingests authoritative sources, and a routing rule that prefers canon when the question is about pricing, model size, or API behavior.
Knowledge-Audit for RAG-KB: From 80% to 99% with Grok 4.3
A real production use case: build a cross-source contradiction detector on Personal-RAG. Verify ground truth with Eval-Framework, swap Haiku → Grok 4.3, real accuracy 80% → 99%.
Mac-Translator: a Force Click translator that respects the flow
Building a menubar translator triggered by Force Click anywhere on macOS. Why the input gesture matters more than the model, and the AX/permission rebuild trap that cost me two evenings.
LLM bake-off across 6 judges: methodology + 7 surprising insights
Tested Claude (Haiku/Sonnet/Opus) vs Grok 4.3 vs GPT-5.4-mini vs Gemini Flash. N=30 dev + N=99 holdout. Stratified eval. Pairwise Cohen's kappa. Surprising winner.
7 decisions an AI assistant almost got wrong on an Eval-Framework SMB project
Field notes through a PM lens: building Eval-Framework for an SMB use case (~15k audits/mo). The AI engineering pair proposed 7 decisions that could have burned budget, leaked compliance, and shipped the wrong roadmap. PM catches + reframes. Counter-prompt playbook below.
Eval-Framework: why PMs need to measure before picking an LLM
Building a personal eval framework: from a 20-case smoke test → 192 stratified cases → a bake-off across 6 models. When 'guessing' which LLM is best gets expensive and wrong.
Diagram-Engine: text-to-diagram designed for AI agents, not humans
Why Mermaid/PlantUML break down when an LLM is the author. JTBD, syntax design constraints, and 37 layout rules that made the output good enough for a public playground.
Why your AI memory lies to you — and how I built a 3-layer Knowledge-Audit to catch it
Claude Code, Cursor, ChatGPT all ship 'persistent memory'. Six months in, the AI confidently asserts facts that are no longer true. I built a 3-layer audit plus a 4-tier cron to verify every claim.
Mail-Assistant: how I cut inbox triage from 90 min to 8 min with an ADHD-first AI agent
Field note: building a personal email triage agent that surfaces only what actually needs action. 568 messages → 7 P0/P1 rows. The 5 product decisions that killed 87% of the noise.
An AI sleep coach over Telegram: PM journey + 5 design principles for ADHD
I've worn a Garmin for 2 years — sleep, HRV, RHR data all there but zero action. This post logs my build through a PM lens: JTBD, existing alternatives, MVP scope decisions, ADHD-first design principles, and a V2 roadmap for a personalized AI sleep coach over Telegram.
Personal RAG KB: A 4-Version Journey from Cloud ADB to Local Mac M2 Max
Why I moved from Oracle ADB cloud → pgvector replica → local Mac — and what I learned measuring latency, quality, and cost at each version.
Building a Mac Voice Assistant and the Painful Lesson About Local LLMs
I built an AI voice assistant for daily use to truly understand the local vs cloud LLM trade-off. The surprise lesson: for agentic products, $0.001/turn Cloud is cheaper than 'free' local — once you count true TCO. A decision framework for product teams and CTOs.
5 Innate Biases of Cloud LLMs & 5 Levers to Overcome Them
Why Claude / GPT / Gemini get it wrong so often when writing workflow docs, PRDs, BRDs — and why you can't 'fix' the weights. You can only build guardrails around them.
The 15-Minute Deep Dive Summary: The Ultimate Prompt for NotebookLM
Turn any document into a deep, vivid 15-minute Podcast Audio Overview entirely in Vietnamese with this prompt.
Second Brain & LLM OS Model - Andrej Karpathy
Interactive page visualizing Andrej Karpathy's philosophy for building an LLM Operating System and RAG architecture.
Prompt Engineering 101 — The Art of Talking to AI
A roundup of fundamental and advanced Prompt Engineering techniques for interacting effectively with LLMs.
Welcome to Vuihoc.AI
First post — Introducing Vuihoc.AI and why I started journaling my AI research journey.