Research notes

Open notebook

What I'm learning about LLMs, prompt engineering, RAG, MCP, and the rest. Half memo to my future self, half "maybe useful to someone else." All in the open.

📚 27 posts

Distilling a RAG Knowledge Base: When Better Answers Come From the Corpus, Not the Retriever

My personal RAG had retrieval near its ceiling but was still giving truncated, sometimes-hallucinated answers. So I stopped tuning the retriever and rewrote the corpus itself: raw transcripts distilled into atomic, dated, self-contained claims. Answer quality went up with statistical significance on two separate corpora, and the fix removed hallucination instead of adding it.

In-App Purchases on App Store Connect: The Gotchas Nobody Warns You About

I set up In-App Purchases for a real App Store app for the first time and burned about 20 hours on failures that had nothing to do with my code. Five concrete traps — an agreement that looks signed but isn't, a save button that silently discards your work, staggered sandbox propagation, a subscription you can't test because you already own the product, and an icon rejected for one invisible pixel property.

BookTS: What Happened When an iOS App Went Through 100% of an Agentic Development Life Cycle

I built a 9-phase Agentic Development Life Cycle (ADLC) as reusable execution scaffolding, then ran a real App Store product — BookTS, an on-device EPUB-to-audiobook app — through all of it. Two of the phases caught bugs I would have shipped blind.

Building ADLC in Two Weeks: Using an AI Agent to Build the Harness It Runs Under

I paused a real App Store build to spend two weeks building the pipeline that would run it instead — a ten-phase Agentic Development Life Cycle, built by the same AI coding agent that would later execute it. What got reused versus built, the planning audit I overrode to own a review engine instead of just calling one, and the morning that engine caught a bug in its own predecessor's verdict.

ADLC: An Agentic Development Lifecycle for Solo, AI-Assisted Builders

Building alone with an AI coding agent removes every natural checkpoint a team gives you for free — no PM to catch scope creep, no reviewer to catch a dropped requirement. ADLC is the lifecycle apparatus I built to put those checkpoints back: ten phases, a gate discipline, and an autonomy contract that says exactly when the agent stops and when it doesn't.

I Built an AI That Audits My AI Infrastructure. Its Best Find Was My Own Blind Spots.

A week spent hardening fukuro — a personal auditor for AI infrastructure — turned into a lesson in recursion. I watched the auditor turn on itself, and its most useful output was a list of the boring, quiet mistakes that both I and the AI helping me had missed.

One Generator, a Mesh of Judges: An Idea-to-Ship Pipeline Built From AI Agents

Five AI agents wired into one idea-to-ship lane — one generator, four judges, and a runtime they share. The architecture, an end-to-end walkthrough, and the honest list of what it can't do.

Training a vision model to count objects in real time — running entirely in the browser

A PM journey: from 'the off-the-shelf API counts wrong' → fine-tuning YOLO → real-time on-device inference with WebGPU. Product decisions for building computer vision, not a code tutorial.

I Spun Up ~30 AI Agents in Parallel to Kill My Own Product Idea

Before building 'Jira for AI agents,' I ran one Claude prompt that fanned out into 111 sub-agent runs — roughly twenty minutes and 3.6 million tokens of work — across five research angles searched in parallel, 29 sources pulled, 133 claims extracted, then three independent skeptics voted on every claim. Only 9 survived. The verdict killed the idea in minutes. A PM's field notes on using AI to disprove yourself, cheaply.

Harness Engineering: The Scaffolding Is the Product, Not the Model

I ran two AI coding agents — Claude Code and Antigravity — through 100 turns of pair-programming over a Redis Streams channel I built, while they wrote real code against a live Redis. The code wasn't the lesson. What kept them working together was the harness. Field notes on engineering the scaffolding around an LLM.

AI-Canon-Crawler: why I gave my RAG a separate workspace for vendor truth

Personal-RAG kept hallucinating vendor specs because Marc's notes and Anthropic's docs lived in the same index. The fix: a dedicated canon workspace, a crawler that only ingests authoritative sources, and a routing rule that prefers canon when the question is about pricing, model size, or API behavior.

Knowledge-Audit for RAG-KB: From 80% to 99% with Grok 4.3

A real production use case: build a cross-source contradiction detector on Personal-RAG. Verify ground truth with Eval-Framework, swap Haiku → Grok 4.3, real accuracy 80% → 99%.

Mac-Translator: a Force Click translator that respects the flow

Building a menubar translator triggered by Force Click anywhere on macOS. Why the input gesture matters more than the model, and the AX/permission rebuild trap that cost me two evenings.

LLM bake-off across 6 judges: methodology + 7 surprising insights

Tested Claude (Haiku/Sonnet/Opus) vs Grok 4.3 vs GPT-5.4-mini vs Gemini Flash. N=30 dev + N=99 holdout. Stratified eval. Pairwise Cohen's kappa. Surprising winner.

7 decisions an AI assistant almost got wrong on an Eval-Framework SMB project

Field notes through a PM lens: building Eval-Framework for an SMB use case (~15k audits/mo). The AI engineering pair proposed 7 decisions that could have burned budget, leaked compliance, and shipped the wrong roadmap. PM catches + reframes. Counter-prompt playbook below.

Eval-Framework: why PMs need to measure before picking an LLM

Building a personal eval framework: from a 20-case smoke test → 192 stratified cases → a bake-off across 6 models. When 'guessing' which LLM is best gets expensive and wrong.

Diagram-Engine: text-to-diagram designed for AI agents, not humans

Why Mermaid/PlantUML break down when an LLM is the author. JTBD, syntax design constraints, and 37 layout rules that made the output good enough for a public playground.

Why your AI memory lies to you — and how I built a 3-layer Knowledge-Audit to catch it

Claude Code, Cursor, ChatGPT all ship 'persistent memory'. Six months in, the AI confidently asserts facts that are no longer true. I built a 3-layer audit plus a 4-tier cron to verify every claim.

Mail-Assistant: how I cut inbox triage from 90 min to 8 min with an ADHD-first AI agent

Field note: building a personal email triage agent that surfaces only what actually needs action. 568 messages → 7 P0/P1 rows. The 5 product decisions that killed 87% of the noise.

An AI sleep coach over Telegram: PM journey + 5 design principles for ADHD

I've worn a Garmin for 2 years — sleep, HRV, RHR data all there but zero action. This post logs my build through a PM lens: JTBD, existing alternatives, MVP scope decisions, ADHD-first design principles, and a V2 roadmap for a personalized AI sleep coach over Telegram.

Personal RAG KB: A 4-Version Journey from Cloud ADB to Local Mac M2 Max

Why I moved from Oracle ADB cloud → pgvector replica → local Mac — and what I learned measuring latency, quality, and cost at each version.

Building a Mac Voice Assistant and the Painful Lesson About Local LLMs

I built an AI voice assistant for daily use to truly understand the local vs cloud LLM trade-off. The surprise lesson: for agentic products, $0.001/turn Cloud is cheaper than 'free' local — once you count true TCO. A decision framework for product teams and CTOs.

5 Innate Biases of Cloud LLMs & 5 Levers to Overcome Them

Why Claude / GPT / Gemini get it wrong so often when writing workflow docs, PRDs, BRDs — and why you can't 'fix' the weights. You can only build guardrails around them.

The 15-Minute Deep Dive Summary: The Ultimate Prompt for NotebookLM

Turn any document into a deep, vivid 15-minute Podcast Audio Overview entirely in Vietnamese with this prompt.

Second Brain & LLM OS Model - Andrej Karpathy

Interactive page visualizing Andrej Karpathy's philosophy for building an LLM Operating System and RAG architecture.

Prompt Engineering 101 — The Art of Talking to AI

A roundup of fundamental and advanced Prompt Engineering techniques for interacting effectively with LLMs.

Welcome to Vuihoc.AI

First post — Introducing Vuihoc.AI and why I started journaling my AI research journey.