Building a Mac Voice Assistant and the Painful Lesson About Local LLMs

TL;DR for product teams/CTOs — 73% of AI agent products launched in 2025-2026 default to local LLMs (open-source) because they’re “free” and “private.” After actually building a voice agent myself, I realized: for agentic flows (multi-step tool calling), local 7-13B models aren’t deep enough → user trust erodes → retry costs are high → real-world TCO ends up more expensive than a Cloud frontier model at $0.001/turn. This post: the journey + a decision framework for teams planning their AI feature roadmap for 2026.

Why a PM bothered coding a voice assistant

I’m a Product Manager, not an engineer. My day job is reviewing AI product roadmaps, evaluating vendors (Anthropic, OpenAI, Mistral, self-hosted), and debating “build vs buy”, “local vs cloud” with engineering teams.

The problem: I couldn’t actually feel the capability gap between frontier LLMs and open-source models. Reading MMLU/GPQA benchmarks doesn’t help when the real user just wants the AI to “open Excel and paste this table” when they ask.

One weekend’s decision: build a native macOS voice assistant from scratch — Iron Man’s Jarvis style. Not to ship it, but to experience decision-by-decision every trade-off a product team has to make when building agentic AI.

Output:

Native macOS app (~50MB)
End-to-end voice loop: speak → STT → LLM → TTS playback
40 tools: open apps, type text, mouse clicks, browser tab control (Safari + Chrome), personal KB search, weather, maps
Dictation hotkey replacing Apple Dictation
Continuous mode + interrupt when you say “stop”

4 product decisions every enterprise will face

1. Cost framework: $/turn is not the only metric

I tested 3 LLM backends for the voice agent. User request: “Open TextEdit and type Hello”. The LLM has to call 2 tools correctly: open_app("TextEdit") then type_text("Hello").

Backend	$/turn	Latency	Tool-call accuracy	User experience
Qwen 2.5:7b local	$0	~3s	30%	Says “Typed Hello” without actually typing
Hermes 3:8b local (function-call tuned)	$0	~4s	60%	Opens app fine, typing fails 40%
Claude Haiku 4.5 cloud	$0.001	~1.5s	~100%	Correct 5 times in a row

Naive cost analysis: Local is free, Cloud is $1/month (30 turns/day). Local wins?

Real enterprise-product TCO analysis:

Cost layer	Local	Cloud
LLM API	$0	$1/user/month
Compute (Ollama daemon, GPU)	~$5/user/month (electricity, hardware amortization)	$0
Retry cost (user has to repeat 2-3 times on failure)	~3× time waste	near 0
Trust erosion (user loses faith → churn)	High	Low
Engineer debug time	~3 hours debugging code just because the model wouldn’t call the tool	0
Ship velocity (special prompt-engineering for local)	-2 weeks	0

Real TCO: Cloud is actually 5-10× cheaper for agentic. Local “free” is an illusion.

Lesson for CFOs/PMs: when presenting AI feature costs, use a TCO framework, not just “API cost.” A frontier LLM 100-1000× more expensive in raw cost can still be a bargain if accuracy hits 100%.

2. Privacy isn’t binary — architect it in layers

The usual reason for picking local LLM: “user data privacy.” I used to think it had to be all-local.

After building, my Voice-Assistant architecture:

Audio capture: 100% on-device (AVAudioEngine local)
Speech-to-text: 100% on-device (Apple SFSpeech)
LLM reasoning: Cloud Haiku — only sends text transcripts (no audio, no KB content)
Tool execution: 100% on-device (open apps, mouse clicks — never leave the Mac)
KB search: 100% local (RAG server on the Mac)
TTS playback: 100% on-device (AVSpeechSynthesizer)

The Cloud sees exactly one thing: the text the user spoke in the session. Anthropic API ToS: no logging for training. Compared to “send raw audio to OpenAI Whisper API,” the privacy footprint is 100× smaller.

Lesson for compliance/legal: privacy is a layered decision, not all-or-nothing. Document every data flow → present “what leaves device, what doesn’t” → compliance teams usually approve selective Cloud use.

3. Wake word “Hey Voice-Assistant” or push-to-talk?

The Iron Man pattern: always-listening, “Hey Voice-Assistant” trigger. Looks sexy in demos.

Production reality:

Always-on mic = 15-25% daily battery drain
2-5 false triggers per day (any Siri user knows)
Enterprise privacy concern: mic always listening during meetings

For V1, I picked a push-to-talk hotkey (Cmd+Shift+J). Trade-off:

Lose the “Hey Voice-Assistant” magical moment
Win: 0% false triggers, 0% battery drain, 0% privacy concern

Lesson for PMs: hot-patterns (always-on) aren’t the right pattern. Test 3 cohorts:

Power-users (love the magic)
Casual users (hate false triggers)
Privacy-sensitive enterprise (require explicit trigger)

2/3 cohorts skew push-to-talk → ship V1 push-to-talk, wake word in V1.5.

4. Build vs buy — the 2026 stack is ready

Building a voice agent in 2024 = 6 months. In 2026 = one weekend. The stack is mature:

On-device STT: Apple SFSpeech, Whisper.cpp, WhisperKit (ANE) — free
LLM: Anthropic SDK, OpenAI SDK, Ollama — plug & play
Native TTS: AVSpeechSynthesizer + Premium voices — free
UI: SwiftUI + Metal — native, polished

Build now makes sense for:

Custom UX (an assistant in your brand’s style)
Stack lock-in control (no single-vendor dependency)
Enterprise data (don’t send mic/transcripts out)

Buy still makes sense for:

Generic productivity (Raycast AI, Apple Intelligence)
Voice for mobile/web (Vapi, Retell)
Time-to-market <2 weeks

Lesson for heads of product: “build a voice agent” is no longer a 6-month roadmap item. Re-evaluate every 6 months — the stack moves faster than project planning.

Summary decision framework

When your team plans an “AI agent feature” for a product, ask 4 questions:

Question	Skew Local	Skew Cloud
Tool calls need ≥90% accuracy?	No (chit-chat)	Yes (agentic action)
Latency < 2s critical?	No (background)	Yes (voice loop, real-time)
Sensitive data in input?	Yes (medical PHI, financial)	No (generic transcript)
User retry cost high? (re-do workflow)	No (idempotent)	Yes (sent email, money transfer)

3-4 questions skew Cloud → Cloud Haiku/Sonnet. 3-4 skew Local → Mistral/Qwen. Mixed → hybrid router (a lightweight classifier routes to the right tier).

What I’ll do next

Hybrid intent router: classify “tool-heavy” vs “chit-chat” → route to the right tier. Save 80% cost while keeping UX fast.
Cost dashboard for users: show users the real cost of each query type → educate the trade-off instead of hiding it.
VN voice cloning + multilingual: shippable for Vietnamese enterprise teams.

Takeaways for product/business leaders

Build / pair-build something once to feel the limits of open-source LLMs. Roadmap based on feel, not on benchmark papers.
TCO framework: $/turn + retry-cost + debug-time + trust-erosion + ship-velocity. Frontier LLMs usually win the cost battle once you count everything.
Privacy is layered, not binary — document data flows, present to compliance, don’t drop Cloud just out of “privacy worry.”
Open-source LLMs are catching up fast, but for agentic flows (multi-step + tool reliability), frontier Claude/GPT-4 still leads in mid-2026. Re-evaluate in Q3.
Stack maturity has hit a tipping point. “Voice AI agent” is no longer a 6-month roadmap — a one-weekend POC is enough for product teams to validate an idea before committing headcount.

I’m a PM writing about AI products, agentic systems, and build vs buy decision frameworks. Comment or DM if your team is planning an AI feature roadmap and wants to deep-dive.