TL;DR for product teams/CTOs — 73% of AI agent products launched in 2025-2026 default to local LLMs (open-source) because they’re “free” and “private.” After actually building a voice agent myself, I realized: for agentic flows (multi-step tool calling), local 7-13B models aren’t deep enough → user trust erodes → retry costs are high → real-world TCO ends up more expensive than a Cloud frontier model at $0.001/turn. This post: the journey + a decision framework for teams planning their AI feature roadmap for 2026.
Why a PM bothered coding a voice assistant
I’m a Product Manager, not an engineer. My day job is reviewing AI product roadmaps, evaluating vendors (Anthropic, OpenAI, Mistral, self-hosted), and debating “build vs buy”, “local vs cloud” with engineering teams.
The problem: I couldn’t actually feel the capability gap between frontier LLMs and open-source models. Reading MMLU/GPQA benchmarks doesn’t help when the real user just wants the AI to “open Excel and paste this table” when they ask.
One weekend’s decision: build a native macOS voice assistant from scratch — Iron Man’s Jarvis style. Not to ship it, but to experience decision-by-decision every trade-off a product team has to make when building agentic AI.
Output:
- Native macOS app (~50MB)
- End-to-end voice loop: speak → STT → LLM → TTS playback
- 40 tools: open apps, type text, mouse clicks, browser tab control (Safari + Chrome), personal KB search, weather, maps
- Dictation hotkey replacing Apple Dictation
- Continuous mode + interrupt when you say “stop”
4 product decisions every enterprise will face
1. Cost framework: $/turn is not the only metric
I tested 3 LLM backends for the voice agent. User request: “Open TextEdit and type Hello”. The LLM has to call 2 tools correctly: open_app("TextEdit") then type_text("Hello").
| Backend | $/turn | Latency | Tool-call accuracy | User experience |
|---|---|---|---|---|
| Qwen 2.5:7b local | $0 | ~3s | 30% | Says “Typed Hello” without actually typing |
| Hermes 3:8b local (function-call tuned) | $0 | ~4s | 60% | Opens app fine, typing fails 40% |
| Claude Haiku 4.5 cloud | $0.001 | ~1.5s | ~100% | Correct 5 times in a row |
Naive cost analysis: Local is free, Cloud is $1/month (30 turns/day). Local wins?
Real enterprise-product TCO analysis:
| Cost layer | Local | Cloud |
|---|---|---|
| LLM API | $0 | $1/user/month |
| Compute (Ollama daemon, GPU) | ~$5/user/month (electricity, hardware amortization) | $0 |
| Retry cost (user has to repeat 2-3 times on failure) | ~3× time waste | near 0 |
| Trust erosion (user loses faith → churn) | High | Low |
| Engineer debug time | ~3 hours debugging code just because the model wouldn’t call the tool | 0 |
| Ship velocity (special prompt-engineering for local) | -2 weeks | 0 |
Real TCO: Cloud is actually 5-10× cheaper for agentic. Local “free” is an illusion.
Lesson for CFOs/PMs: when presenting AI feature costs, use a TCO framework, not just “API cost.” A frontier LLM 100-1000× more expensive in raw cost can still be a bargain if accuracy hits 100%.
2. Privacy isn’t binary — architect it in layers
The usual reason for picking local LLM: “user data privacy.” I used to think it had to be all-local.
After building, my Voice-Assistant architecture:
- Audio capture: 100% on-device (AVAudioEngine local)
- Speech-to-text: 100% on-device (Apple SFSpeech)
- LLM reasoning: Cloud Haiku — only sends text transcripts (no audio, no KB content)
- Tool execution: 100% on-device (open apps, mouse clicks — never leave the Mac)
- KB search: 100% local (RAG server on the Mac)
- TTS playback: 100% on-device (AVSpeechSynthesizer)
The Cloud sees exactly one thing: the text the user spoke in the session. Anthropic API ToS: no logging for training. Compared to “send raw audio to OpenAI Whisper API,” the privacy footprint is 100× smaller.
Lesson for compliance/legal: privacy is a layered decision, not all-or-nothing. Document every data flow → present “what leaves device, what doesn’t” → compliance teams usually approve selective Cloud use.
3. Wake word “Hey Voice-Assistant” or push-to-talk?
The Iron Man pattern: always-listening, “Hey Voice-Assistant” trigger. Looks sexy in demos.
Production reality:
- Always-on mic = 15-25% daily battery drain
- 2-5 false triggers per day (any Siri user knows)
- Enterprise privacy concern: mic always listening during meetings
For V1, I picked a push-to-talk hotkey (Cmd+Shift+J). Trade-off:
- Lose the “Hey Voice-Assistant” magical moment
- Win: 0% false triggers, 0% battery drain, 0% privacy concern
Lesson for PMs: hot-patterns (always-on) aren’t the right pattern. Test 3 cohorts:
- Power-users (love the magic)
- Casual users (hate false triggers)
- Privacy-sensitive enterprise (require explicit trigger)
2/3 cohorts skew push-to-talk → ship V1 push-to-talk, wake word in V1.5.
4. Build vs buy — the 2026 stack is ready
Building a voice agent in 2024 = 6 months. In 2026 = one weekend. The stack is mature:
- On-device STT: Apple SFSpeech, Whisper.cpp, WhisperKit (ANE) — free
- LLM: Anthropic SDK, OpenAI SDK, Ollama — plug & play
- Native TTS: AVSpeechSynthesizer + Premium voices — free
- UI: SwiftUI + Metal — native, polished
Build now makes sense for:
- Custom UX (an assistant in your brand’s style)
- Stack lock-in control (no single-vendor dependency)
- Enterprise data (don’t send mic/transcripts out)
Buy still makes sense for:
- Generic productivity (Raycast AI, Apple Intelligence)
- Voice for mobile/web (Vapi, Retell)
- Time-to-market <2 weeks
Lesson for heads of product: “build a voice agent” is no longer a 6-month roadmap item. Re-evaluate every 6 months — the stack moves faster than project planning.
Summary decision framework
When your team plans an “AI agent feature” for a product, ask 4 questions:
| Question | Skew Local | Skew Cloud |
|---|---|---|
| Tool calls need ≥90% accuracy? | No (chit-chat) | Yes (agentic action) |
| Latency < 2s critical? | No (background) | Yes (voice loop, real-time) |
| Sensitive data in input? | Yes (medical PHI, financial) | No (generic transcript) |
| User retry cost high? (re-do workflow) | No (idempotent) | Yes (sent email, money transfer) |
3-4 questions skew Cloud → Cloud Haiku/Sonnet. 3-4 skew Local → Mistral/Qwen. Mixed → hybrid router (a lightweight classifier routes to the right tier).
What I’ll do next
- Hybrid intent router: classify “tool-heavy” vs “chit-chat” → route to the right tier. Save 80% cost while keeping UX fast.
- Cost dashboard for users: show users the real cost of each query type → educate the trade-off instead of hiding it.
- VN voice cloning + multilingual: shippable for Vietnamese enterprise teams.
Takeaways for product/business leaders
- Build / pair-build something once to feel the limits of open-source LLMs. Roadmap based on feel, not on benchmark papers.
- TCO framework: $/turn + retry-cost + debug-time + trust-erosion + ship-velocity. Frontier LLMs usually win the cost battle once you count everything.
- Privacy is layered, not binary — document data flows, present to compliance, don’t drop Cloud just out of “privacy worry.”
- Open-source LLMs are catching up fast, but for agentic flows (multi-step + tool reliability), frontier Claude/GPT-4 still leads in mid-2026. Re-evaluate in Q3.
- Stack maturity has hit a tipping point. “Voice AI agent” is no longer a 6-month roadmap — a one-weekend POC is enough for product teams to validate an idea before committing headcount.
I’m a PM writing about AI products, agentic systems, and build vs buy decision frameworks. Comment or DM if your team is planning an AI feature roadmap and wants to deep-dive.