← All posts
📅

Building a Mac Voice Assistant and the Painful Lesson About Local LLMs

I built an AI voice assistant for daily use to truly understand the local vs cloud LLM trade-off. The surprise lesson: for agentic products, $0.001/turn Cloud is cheaper than 'free' local — once you count true TCO. A decision framework for product teams and CTOs.

TL;DR for product teams/CTOs — 73% of AI agent products launched in 2025-2026 default to local LLMs (open-source) because they’re “free” and “private.” After actually building a voice agent myself, I realized: for agentic flows (multi-step tool calling), local 7-13B models aren’t deep enough → user trust erodes → retry costs are high → real-world TCO ends up more expensive than a Cloud frontier model at $0.001/turn. This post: the journey + a decision framework for teams planning their AI feature roadmap for 2026.

Why a PM bothered coding a voice assistant

I’m a Product Manager, not an engineer. My day job is reviewing AI product roadmaps, evaluating vendors (Anthropic, OpenAI, Mistral, self-hosted), and debating “build vs buy”, “local vs cloud” with engineering teams.

The problem: I couldn’t actually feel the capability gap between frontier LLMs and open-source models. Reading MMLU/GPQA benchmarks doesn’t help when the real user just wants the AI to “open Excel and paste this table” when they ask.

One weekend’s decision: build a native macOS voice assistant from scratch — Iron Man’s Jarvis style. Not to ship it, but to experience decision-by-decision every trade-off a product team has to make when building agentic AI.

Output:

  • Native macOS app (~50MB)
  • End-to-end voice loop: speak → STT → LLM → TTS playback
  • 40 tools: open apps, type text, mouse clicks, browser tab control (Safari + Chrome), personal KB search, weather, maps
  • Dictation hotkey replacing Apple Dictation
  • Continuous mode + interrupt when you say “stop”

4 product decisions every enterprise will face

1. Cost framework: $/turn is not the only metric

I tested 3 LLM backends for the voice agent. User request: “Open TextEdit and type Hello”. The LLM has to call 2 tools correctly: open_app("TextEdit") then type_text("Hello").

Backend$/turnLatencyTool-call accuracyUser experience
Qwen 2.5:7b local$0~3s30%Says “Typed Hello” without actually typing
Hermes 3:8b local (function-call tuned)$0~4s60%Opens app fine, typing fails 40%
Claude Haiku 4.5 cloud$0.001~1.5s~100%Correct 5 times in a row

Naive cost analysis: Local is free, Cloud is $1/month (30 turns/day). Local wins?

Real enterprise-product TCO analysis:

Cost layerLocalCloud
LLM API$0$1/user/month
Compute (Ollama daemon, GPU)~$5/user/month (electricity, hardware amortization)$0
Retry cost (user has to repeat 2-3 times on failure)~3× time wastenear 0
Trust erosion (user loses faith → churn)HighLow
Engineer debug time~3 hours debugging code just because the model wouldn’t call the tool0
Ship velocity (special prompt-engineering for local)-2 weeks0

Real TCO: Cloud is actually 5-10× cheaper for agentic. Local “free” is an illusion.

Lesson for CFOs/PMs: when presenting AI feature costs, use a TCO framework, not just “API cost.” A frontier LLM 100-1000× more expensive in raw cost can still be a bargain if accuracy hits 100%.

2. Privacy isn’t binary — architect it in layers

The usual reason for picking local LLM: “user data privacy.” I used to think it had to be all-local.

After building, my Voice-Assistant architecture:

  • Audio capture: 100% on-device (AVAudioEngine local)
  • Speech-to-text: 100% on-device (Apple SFSpeech)
  • LLM reasoning: Cloud Haiku — only sends text transcripts (no audio, no KB content)
  • Tool execution: 100% on-device (open apps, mouse clicks — never leave the Mac)
  • KB search: 100% local (RAG server on the Mac)
  • TTS playback: 100% on-device (AVSpeechSynthesizer)

The Cloud sees exactly one thing: the text the user spoke in the session. Anthropic API ToS: no logging for training. Compared to “send raw audio to OpenAI Whisper API,” the privacy footprint is 100× smaller.

Lesson for compliance/legal: privacy is a layered decision, not all-or-nothing. Document every data flow → present “what leaves device, what doesn’t” → compliance teams usually approve selective Cloud use.

3. Wake word “Hey Voice-Assistant” or push-to-talk?

The Iron Man pattern: always-listening, “Hey Voice-Assistant” trigger. Looks sexy in demos.

Production reality:

  • Always-on mic = 15-25% daily battery drain
  • 2-5 false triggers per day (any Siri user knows)
  • Enterprise privacy concern: mic always listening during meetings

For V1, I picked a push-to-talk hotkey (Cmd+Shift+J). Trade-off:

  • Lose the “Hey Voice-Assistant” magical moment
  • Win: 0% false triggers, 0% battery drain, 0% privacy concern

Lesson for PMs: hot-patterns (always-on) aren’t the right pattern. Test 3 cohorts:

  • Power-users (love the magic)
  • Casual users (hate false triggers)
  • Privacy-sensitive enterprise (require explicit trigger)

2/3 cohorts skew push-to-talk → ship V1 push-to-talk, wake word in V1.5.

4. Build vs buy — the 2026 stack is ready

Building a voice agent in 2024 = 6 months. In 2026 = one weekend. The stack is mature:

  • On-device STT: Apple SFSpeech, Whisper.cpp, WhisperKit (ANE) — free
  • LLM: Anthropic SDK, OpenAI SDK, Ollama — plug & play
  • Native TTS: AVSpeechSynthesizer + Premium voices — free
  • UI: SwiftUI + Metal — native, polished

Build now makes sense for:

  • Custom UX (an assistant in your brand’s style)
  • Stack lock-in control (no single-vendor dependency)
  • Enterprise data (don’t send mic/transcripts out)

Buy still makes sense for:

  • Generic productivity (Raycast AI, Apple Intelligence)
  • Voice for mobile/web (Vapi, Retell)
  • Time-to-market <2 weeks

Lesson for heads of product: “build a voice agent” is no longer a 6-month roadmap item. Re-evaluate every 6 months — the stack moves faster than project planning.

Summary decision framework

When your team plans an “AI agent feature” for a product, ask 4 questions:

QuestionSkew LocalSkew Cloud
Tool calls need ≥90% accuracy?No (chit-chat)Yes (agentic action)
Latency < 2s critical?No (background)Yes (voice loop, real-time)
Sensitive data in input?Yes (medical PHI, financial)No (generic transcript)
User retry cost high? (re-do workflow)No (idempotent)Yes (sent email, money transfer)

3-4 questions skew Cloud → Cloud Haiku/Sonnet. 3-4 skew Local → Mistral/Qwen. Mixed → hybrid router (a lightweight classifier routes to the right tier).

What I’ll do next

  1. Hybrid intent router: classify “tool-heavy” vs “chit-chat” → route to the right tier. Save 80% cost while keeping UX fast.
  2. Cost dashboard for users: show users the real cost of each query type → educate the trade-off instead of hiding it.
  3. VN voice cloning + multilingual: shippable for Vietnamese enterprise teams.

Takeaways for product/business leaders

  1. Build / pair-build something once to feel the limits of open-source LLMs. Roadmap based on feel, not on benchmark papers.
  2. TCO framework: $/turn + retry-cost + debug-time + trust-erosion + ship-velocity. Frontier LLMs usually win the cost battle once you count everything.
  3. Privacy is layered, not binary — document data flows, present to compliance, don’t drop Cloud just out of “privacy worry.”
  4. Open-source LLMs are catching up fast, but for agentic flows (multi-step + tool reliability), frontier Claude/GPT-4 still leads in mid-2026. Re-evaluate in Q3.
  5. Stack maturity has hit a tipping point. “Voice AI agent” is no longer a 6-month roadmap — a one-weekend POC is enough for product teams to validate an idea before committing headcount.

I’m a PM writing about AI products, agentic systems, and build vs buy decision frameworks. Comment or DM if your team is planning an AI feature roadmap and wants to deep-dive.