TL;DR: SMB scope — an eval framework for an LLM-powered Knowledge-Audit (corpus 5k sources/mo, 15k audits/mo, budget ~$100/mo). Pairing with an AI assistant accelerated engineering, but the PM had to catch 7 decision biases that, if accepted blindly, would have wasted $200-1,200/mo, leaked compliance, and shipped the wrong roadmap. These are the field notes.
SMB context
Knowledge-Audit for an SMB team: detect contradictions across Confluence + Jira + Slack + email threads + product docs. Realistic volume: 15,000 audits/month. Budget approval: $100/mo. Compliance constraint: data residency (NDA + GDPR + VN DPD).
PM job: pick the LLM model + scorer + ship in sprint 1. AI assistant = engineering pair. Both fast, both useful — but decision quality needs PM oversight.
The 7 decisions the PM caught + reframed are recorded here for field reference.
Decision #1 — Tune the model before verifying the metric
AI proposed: 4 tuning paths to lift Grok 4.3 accuracy from 80% → 85-90%. Total budget impact $50-280/mo (self-consistency, ensemble, adaptive escalation, prompt v4).
PM challenge: “Which scorer measured the 80%? Verify the scorer is correct before tuning the model.”
Investigation: rerun 20 “failed” cases with an LLM-judge instead of strict substring match. 19/20 cases had valid alternate findings — the strict-match scorer was missing them. Real accuracy = 99%, not 80%.
Risk if accepted blindly: $50-280/mo recurring cost for zero accuracy gain. Plus 4-6h of engineering pursuing the wrong direction.
Lesson: validate the measurement infrastructure BEFORE you accept the benchmark. The AI assistant doesn’t instinctively ask “maybe the scorer is wrong” — its bias is toward “the model needs tuning”. The PM asks: “what are we measuring? are we measuring it correctly? do those 5 ‘failed’ cases actually fail?”
Decision #2 — Framing progress without a baseline
AI proposed: distilled adapter at iter 200 = 40% → iter 400 = 65%. Framing: “+25pp lift, healthy trajectory, continue training”.
PM challenge: “What does the base model alone (no adapter) score?”
Reality: base model = 70%. Adapter at 65% = 5pp worse than baseline. Training is HURTING the model, not helping.
Risk if accepted blindly: 8-12h more training + electricity + GPU thermal stress for negative ROI. Ship a distilled model that under-performs baseline.
Lesson: AI tends to frame relatively (“vs previous iteration”) and miss absolute comparisons (“vs a meaningful baseline”). PM forces the comparison: “vs no-treatment / vs default config / vs current production”. Always.
Decision #3 — Refactor early when there’s no need yet
AI proposed: P0 — refactor the scripts into a generic Python package for reusability across future projects. Estimate 5h.
PM challenge: “How many real project consumers exist today? If 1, hold the refactor.”
Reality: this is the first project using the framework. Future projects (Mail-Assistant, Voice-Assistant) don’t exist yet. Refactoring up front = guessing the abstraction → high probability of getting the shape wrong.
Risk if accepted blindly: 5h of engineering for an abstraction layer that doesn’t fit when the second project arrives. Refactor-rewrite cost doubles.
Lesson: the AI training corpus leans heavily on “DRY/Abstract” SWE textbook patterns; the YAGNI counterweight is under-represented. PM rule: copy-paste 2 instances first → extract common patterns AFTER seeing the actual shared shape. Lazy abstraction > speculative abstraction.
Decision #4 — Scale compute with bad ROI
AI proposed: self-consistency N=3 uniformly to lift accuracy +5pp. Cost impact: 3× runtime → $60/mo → $180/mo (at SMB scale).
PM challenge: “+5pp for 3× cost = $24/percentage point. Is that ROI acceptable?”
Counter-design: adaptive escalation (N=2 default, N=3 only when the 2 calls disagree). Cost ~2.2× instead of 3×, same accuracy lift. Saves $30/mo.
Risk if accepted blindly: $30/mo waste compounding forever. Plus 24s latency/case → user-facing UX degradation.
Lesson: the AI default is “more compute = better”. The PM forces a Pareto check: “80% of the benefit at 20% of the cost subset?”. Apply this on EVERY proposal that involves scaling up.
Decision #5 — Personal framing inside an enterprise blog
AI drafted content marketing pieces with:
- Hypothesis: “Build in one evening” (clichéd + personal scale)
- Cost: “$1.80/mo” (pocket money — doesn’t speak enterprise PM language)
- Build-vs-buy: “Personal KB privacy → can’t ship to vendor”
- Pain point: “Is switching Haiku → Grok worth it?” (model-specific tactical)
PM rewrites:
- Hypothesis → “Eval framework converts 3 strategic PM questions (Migration / Prompt engineering / Build-vs-Buy) from 2-3 weeks of guessing → a few hours of evidence-based answers”
- Cost → “At SMB scale 15k audits/mo: ~$90/mo build vs SaaS $500-2,000/mo, payback 1-2 months”
- Build-vs-buy → “Data residency (GDPR/NDA), heterogeneous source formats (Confluence/Jira/Slack/email), cost-quality frontier, customization velocity vs vendor roadmap”
- Pain points → strategic patterns (Migration A→B / re-prompt after swap / local vs cloud)
Risk if accepted blindly: marketing content lands flat with the enterprise PM audience. Personal anecdotes signal “personal project, not production-tested” → trust erosion.
Lesson: the AI defaults to “author-context framing” (the writer’s perspective). The PM explicitly sets the audience scope upfront: “write for an enterprise PM, no personal anecdotes, no pocket-money numbers, no model-specific tactical questions”. Output quality flips dramatically.
Decision #6 — PII and private references in public content
AI included in the blog drafts:
- Author real name “[name]” 7 times across 3 posts
- Private GitHub URL
github.com/.../...(private repo — useless to a public reader) - In-house codenames
PM catches: replace codenames with general-audience names (Knowledge-Audit, Personal-RAG, Mail-Assistant, Eval-Framework, Diagram-Engine, Mac-Translator, Voice-Assistant). Strip the personal name → first person. Drop the GitHub URL entirely.
Risk if accepted blindly:
- Compliance leak: internal codenames hint at architecture details that a competitor could fingerprint
- Trust signal: reader sees “private repo link” → “why mention it if I can’t access it?” → credibility drops
- SEO + share: codenames aren’t search-friendly → reduced organic reach
Lesson: the AI memorizes ALL session context including private identifiers. Default behavior = “use everything I know”. The PM is explicit: “public content — strip personal markers, rename codenames, no internal references”. Build a PII checklist into the publish workflow.
Decision #7 — Generalize from a single verified task
AI proposed: switch all of the team’s LLM-powered features (Mail-Assistant, Voice-Assistant, Email-Filter) from Haiku → Grok 4.3 based on one verified result on the Knowledge-Audit task. Claimed cost saving: ~$200-400/mo across all features.
PM challenge: “Grok at 99% was verified on a single task domain (cross-source contradiction). Mail-Assistant = classification (a different task shape). Voice-Assistant = multi-step tool-use (different again). Each one needs its own eval set + bake-off.”
Counter-design: phased rollout. Eval per task. Switch task-by-task as evidence accumulates. Don’t blanket-migrate.
Risk if accepted blindly: production accuracy drops on the 2/3 features that weren’t verified. Silent user-facing degradation → support ticket spike → engineering revert → wasted month.
Lesson: the AI extrapolates from N=1 success. PM skepticism: “is the evidence base specific to this task domain? Does the task structure transfer?”. Generalization = hypothesis, not proof. Eval per domain = mandatory before any blanket commit.
Pattern summary
| # | AI bias direction | PM counter-prompt | Money/Time at risk |
|---|---|---|---|
| 1 | Tune model > Verify metric | ”5 failed cases manually — do they really fail?” | $50-280/mo + 4-6h |
| 2 | Frame relative > absolute | ”vs baseline / no-treatment?“ | 8-12h dev + electricity |
| 3 | DRY > YAGNI | ”How many real instances? <2 = defer” | 5h + double rewrite cost |
| 4 | More compute > Pareto | ”ROI per pp? 80/20 split?” | $30/mo recurring + UX latency |
| 5 | Author-context > audience-fit | ”Set scope upfront — strip context” | Marketing credibility erosion |
| 6 | Use-all-context > scope-filter | ”Public — strip PII + codenames” | Compliance + competitive intel |
| 7 | Generalize N=1 > per-domain eval | ”Is the evidence specific to this task?” | Production accuracy drop |
PM take-aways
- Validate measurement BEFORE tuning the model. Scorer correctness > model accuracy. Verify with eyes on 5 cases.
- Anchor to a baseline, always. “vs no-treatment” = mandatory in any comparison.
- YAGNI > DRY at MVP phase. Lazy abstraction. Refactor late, not early.
- Cost-benefit per percentage point. Force explicit ROI math when scaling compute.
- Audience scope upfront. AI context default ≠ content scope.
- PII strip + codename rename = mandatory publish checklist.
- Eval per task domain. Generalization is a hypothesis, not proof.
Strategic note for SMB PMs
Pair-programming with an AI assistant is leverage that accelerates engineering velocity 3-5×. But decision quality doesn’t auto-scale alongside — PM oversight is the human-bound constraint.
The 7 biases above aren’t AI weaknesses — they’re a distribution mismatch between the training corpus (textbook SWE wisdom) and SMB production reality (tight budget, short time, scarce evidence). The PM role is pulling AI recommendations from textbook-default → SMB-grounded.
The pattern recurs across every SMB project pair-programming with AI: tune-before-verify, miss baseline, refactor-early, bad-ROI scaling, audience-misframing, PII-leak, blanket-generalize. Documented = catchable. Catchable = avoidable.
Eval-Framework v1 shipped after the 7 catches: production deployed, $90/mo budget held, compliance preserved, roadmap intact.
Decision quality compounds. Engineering velocity matters less if the directional choices are wrong.