Research notes

Open notebook

What I'm learning about LLMs, prompt engineering, RAG, MCP, and the rest. Half memo to my future self, half "maybe useful to someone else." All in the open.

📚 18 posts
May 28, 2026

Harness Engineering: The Scaffolding Is the Product, Not the Model

I ran two AI coding agents — Claude Code and Antigravity — through 100 turns of pair-programming over a Redis Streams channel I built, while they wrote real code against a live Redis. The code wasn't the lesson. What kept them working together was the harness. Field notes on engineering the scaffolding around an LLM.

Read post
May 24, 2026

AI-Canon-Crawler: why I gave my RAG a separate workspace for vendor truth

Personal-RAG kept hallucinating vendor specs because Marc's notes and Anthropic's docs lived in the same index. The fix: a dedicated canon workspace, a crawler that only ingests authoritative sources, and a routing rule that prefers canon when the question is about pricing, model size, or API behavior.

Read post
Apr 23, 2026

Knowledge-Audit for RAG-KB: From 80% to 99% with Grok 4.3

A real production use case: build a cross-source contradiction detector on Personal-RAG. Verify ground truth with Eval-Framework, swap Haiku → Grok 4.3, real accuracy 80% → 99%.

Read post
Apr 10, 2026

Mac-Translator: a Force Click translator that respects the flow

Building a menubar translator triggered by Force Click anywhere on macOS. Why the input gesture matters more than the model, and the AX/permission rebuild trap that cost me two evenings.

Read post
Mar 22, 2026

LLM bake-off across 6 judges: methodology + 7 surprising insights

Tested Claude (Haiku/Sonnet/Opus) vs Grok 4.3 vs GPT-5.4-mini vs Gemini Flash. N=30 dev + N=99 holdout. Stratified eval. Pairwise Cohen's kappa. Surprising winner.

Read post
Feb 24, 2026

7 decisions an AI assistant almost got wrong on an Eval-Framework SMB project

Field notes through a PM lens: building Eval-Framework for an SMB use case (~15k audits/mo). The AI engineering pair proposed 7 decisions that could have burned budget, leaked compliance, and shipped the wrong roadmap. PM catches + reframes. Counter-prompt playbook below.

Read post
Jan 21, 2026

Eval-Framework: why PMs need to measure before picking an LLM

Building a personal eval framework: from a 20-case smoke test → 192 stratified cases → a bake-off across 6 models. When 'guessing' which LLM is best gets expensive and wrong.

Read post
Dec 8, 2025

Diagram-Engine: text-to-diagram designed for AI agents, not humans

Why Mermaid/PlantUML break down when an LLM is the author. JTBD, syntax design constraints, and 37 layout rules that made the output good enough for a public playground.

Read post
Nov 8, 2025

Why your AI memory lies to you — and how I built a 3-layer Knowledge-Audit to catch it

Claude Code, Cursor, ChatGPT all ship 'persistent memory'. Six months in, the AI confidently asserts facts that are no longer true. I built a 3-layer audit plus a 4-tier cron to verify every claim.

Read post
Sep 15, 2025

Mail-Assistant: how I cut inbox triage from 90 min to 8 min with an ADHD-first AI agent

Field note: building a personal email triage agent that surfaces only what actually needs action. 568 messages → 7 P0/P1 rows. The 5 product decisions that killed 87% of the noise.

Read post
Aug 9, 2025

An AI sleep coach over Telegram: PM journey + 5 design principles for ADHD

I've worn a Garmin for 2 years — sleep, HRV, RHR data all there but zero action. This post logs my build through a PM lens: JTBD, existing alternatives, MVP scope decisions, ADHD-first design principles, and a V2 roadmap for a personalized AI sleep coach over Telegram.

Read post
May 22, 2025

Personal RAG KB: A 4-Version Journey from Cloud ADB to Local Mac M2 Max

Why I moved from Oracle ADB cloud → pgvector replica → local Mac — and what I learned measuring latency, quality, and cost at each version.

Read post
Feb 18, 2025

Building a Mac Voice Assistant and the Painful Lesson About Local LLMs

I built an AI voice assistant for daily use to truly understand the local vs cloud LLM trade-off. The surprise lesson: for agentic products, $0.001/turn Cloud is cheaper than 'free' local — once you count true TCO. A decision framework for product teams and CTOs.

Read post
Dec 5, 2024

5 Innate Biases of Cloud LLMs & 5 Levers to Overcome Them

Why Claude / GPT / Gemini get it wrong so often when writing workflow docs, PRDs, BRDs — and why you can't 'fix' the weights. You can only build guardrails around them.

Read post
Oct 10, 2024

The 15-Minute Deep Dive Summary: The Ultimate Prompt for NotebookLM

Turn any document into a deep, vivid 15-minute Podcast Audio Overview entirely in Vietnamese with this prompt.

Read post
Aug 20, 2024

Second Brain & LLM OS Model - Andrej Karpathy

Interactive page visualizing Andrej Karpathy's philosophy for building an LLM Operating System and RAG architecture.

Read post
Jun 15, 2024

Prompt Engineering 101 — The Art of Talking to AI

A roundup of fundamental and advanced Prompt Engineering techniques for interacting effectively with LLMs.

Read post
Apr 1, 2024

Welcome to Vuihoc.AI

First post — Introducing Vuihoc.AI and why I started journaling my AI research journey.

Read post