← All posts
📅

7 decisions an AI assistant almost got wrong on an Eval-Framework SMB project

Field notes through a PM lens: building Eval-Framework for an SMB use case (~15k audits/mo). The AI engineering pair proposed 7 decisions that could have burned budget, leaked compliance, and shipped the wrong roadmap. PM catches + reframes. Counter-prompt playbook below.

TL;DR: SMB scope — an eval framework for an LLM-powered Knowledge-Audit (corpus 5k sources/mo, 15k audits/mo, budget ~$100/mo). Pairing with an AI assistant accelerated engineering, but the PM had to catch 7 decision biases that, if accepted blindly, would have wasted $200-1,200/mo, leaked compliance, and shipped the wrong roadmap. These are the field notes.

SMB context

Knowledge-Audit for an SMB team: detect contradictions across Confluence + Jira + Slack + email threads + product docs. Realistic volume: 15,000 audits/month. Budget approval: $100/mo. Compliance constraint: data residency (NDA + GDPR + VN DPD).

PM job: pick the LLM model + scorer + ship in sprint 1. AI assistant = engineering pair. Both fast, both useful — but decision quality needs PM oversight.

The 7 decisions the PM caught + reframed are recorded here for field reference.

Decision #1 — Tune the model before verifying the metric

AI proposed: 4 tuning paths to lift Grok 4.3 accuracy from 80% → 85-90%. Total budget impact $50-280/mo (self-consistency, ensemble, adaptive escalation, prompt v4).

PM challenge: “Which scorer measured the 80%? Verify the scorer is correct before tuning the model.”

Investigation: rerun 20 “failed” cases with an LLM-judge instead of strict substring match. 19/20 cases had valid alternate findings — the strict-match scorer was missing them. Real accuracy = 99%, not 80%.

Risk if accepted blindly: $50-280/mo recurring cost for zero accuracy gain. Plus 4-6h of engineering pursuing the wrong direction.

Lesson: validate the measurement infrastructure BEFORE you accept the benchmark. The AI assistant doesn’t instinctively ask “maybe the scorer is wrong” — its bias is toward “the model needs tuning”. The PM asks: “what are we measuring? are we measuring it correctly? do those 5 ‘failed’ cases actually fail?”

Decision #2 — Framing progress without a baseline

AI proposed: distilled adapter at iter 200 = 40% → iter 400 = 65%. Framing: “+25pp lift, healthy trajectory, continue training”.

PM challenge: “What does the base model alone (no adapter) score?”

Reality: base model = 70%. Adapter at 65% = 5pp worse than baseline. Training is HURTING the model, not helping.

Risk if accepted blindly: 8-12h more training + electricity + GPU thermal stress for negative ROI. Ship a distilled model that under-performs baseline.

Lesson: AI tends to frame relatively (“vs previous iteration”) and miss absolute comparisons (“vs a meaningful baseline”). PM forces the comparison: “vs no-treatment / vs default config / vs current production”. Always.

Decision #3 — Refactor early when there’s no need yet

AI proposed: P0 — refactor the scripts into a generic Python package for reusability across future projects. Estimate 5h.

PM challenge: “How many real project consumers exist today? If 1, hold the refactor.”

Reality: this is the first project using the framework. Future projects (Mail-Assistant, Voice-Assistant) don’t exist yet. Refactoring up front = guessing the abstraction → high probability of getting the shape wrong.

Risk if accepted blindly: 5h of engineering for an abstraction layer that doesn’t fit when the second project arrives. Refactor-rewrite cost doubles.

Lesson: the AI training corpus leans heavily on “DRY/Abstract” SWE textbook patterns; the YAGNI counterweight is under-represented. PM rule: copy-paste 2 instances first → extract common patterns AFTER seeing the actual shared shape. Lazy abstraction > speculative abstraction.

Decision #4 — Scale compute with bad ROI

AI proposed: self-consistency N=3 uniformly to lift accuracy +5pp. Cost impact: 3× runtime → $60/mo → $180/mo (at SMB scale).

PM challenge: “+5pp for 3× cost = $24/percentage point. Is that ROI acceptable?”

Counter-design: adaptive escalation (N=2 default, N=3 only when the 2 calls disagree). Cost ~2.2× instead of 3×, same accuracy lift. Saves $30/mo.

Risk if accepted blindly: $30/mo waste compounding forever. Plus 24s latency/case → user-facing UX degradation.

Lesson: the AI default is “more compute = better”. The PM forces a Pareto check: “80% of the benefit at 20% of the cost subset?”. Apply this on EVERY proposal that involves scaling up.

Decision #5 — Personal framing inside an enterprise blog

AI drafted content marketing pieces with:

  • Hypothesis: “Build in one evening” (clichéd + personal scale)
  • Cost: “$1.80/mo” (pocket money — doesn’t speak enterprise PM language)
  • Build-vs-buy: “Personal KB privacy → can’t ship to vendor”
  • Pain point: “Is switching Haiku → Grok worth it?” (model-specific tactical)

PM rewrites:

  • Hypothesis → “Eval framework converts 3 strategic PM questions (Migration / Prompt engineering / Build-vs-Buy) from 2-3 weeks of guessing → a few hours of evidence-based answers”
  • Cost → “At SMB scale 15k audits/mo: ~$90/mo build vs SaaS $500-2,000/mo, payback 1-2 months”
  • Build-vs-buy → “Data residency (GDPR/NDA), heterogeneous source formats (Confluence/Jira/Slack/email), cost-quality frontier, customization velocity vs vendor roadmap”
  • Pain points → strategic patterns (Migration A→B / re-prompt after swap / local vs cloud)

Risk if accepted blindly: marketing content lands flat with the enterprise PM audience. Personal anecdotes signal “personal project, not production-tested” → trust erosion.

Lesson: the AI defaults to “author-context framing” (the writer’s perspective). The PM explicitly sets the audience scope upfront: “write for an enterprise PM, no personal anecdotes, no pocket-money numbers, no model-specific tactical questions”. Output quality flips dramatically.

Decision #6 — PII and private references in public content

AI included in the blog drafts:

  • Author real name “[name]” 7 times across 3 posts
  • Private GitHub URL github.com/.../... (private repo — useless to a public reader)
  • In-house codenames

PM catches: replace codenames with general-audience names (Knowledge-Audit, Personal-RAG, Mail-Assistant, Eval-Framework, Diagram-Engine, Mac-Translator, Voice-Assistant). Strip the personal name → first person. Drop the GitHub URL entirely.

Risk if accepted blindly:

  • Compliance leak: internal codenames hint at architecture details that a competitor could fingerprint
  • Trust signal: reader sees “private repo link” → “why mention it if I can’t access it?” → credibility drops
  • SEO + share: codenames aren’t search-friendly → reduced organic reach

Lesson: the AI memorizes ALL session context including private identifiers. Default behavior = “use everything I know”. The PM is explicit: “public content — strip personal markers, rename codenames, no internal references”. Build a PII checklist into the publish workflow.

Decision #7 — Generalize from a single verified task

AI proposed: switch all of the team’s LLM-powered features (Mail-Assistant, Voice-Assistant, Email-Filter) from Haiku → Grok 4.3 based on one verified result on the Knowledge-Audit task. Claimed cost saving: ~$200-400/mo across all features.

PM challenge: “Grok at 99% was verified on a single task domain (cross-source contradiction). Mail-Assistant = classification (a different task shape). Voice-Assistant = multi-step tool-use (different again). Each one needs its own eval set + bake-off.”

Counter-design: phased rollout. Eval per task. Switch task-by-task as evidence accumulates. Don’t blanket-migrate.

Risk if accepted blindly: production accuracy drops on the 2/3 features that weren’t verified. Silent user-facing degradation → support ticket spike → engineering revert → wasted month.

Lesson: the AI extrapolates from N=1 success. PM skepticism: “is the evidence base specific to this task domain? Does the task structure transfer?”. Generalization = hypothesis, not proof. Eval per domain = mandatory before any blanket commit.

Pattern summary

#AI bias directionPM counter-promptMoney/Time at risk
1Tune model > Verify metric”5 failed cases manually — do they really fail?”$50-280/mo + 4-6h
2Frame relative > absolute”vs baseline / no-treatment?“8-12h dev + electricity
3DRY > YAGNI”How many real instances? <2 = defer”5h + double rewrite cost
4More compute > Pareto”ROI per pp? 80/20 split?”$30/mo recurring + UX latency
5Author-context > audience-fit”Set scope upfront — strip context”Marketing credibility erosion
6Use-all-context > scope-filter”Public — strip PII + codenames”Compliance + competitive intel
7Generalize N=1 > per-domain eval”Is the evidence specific to this task?”Production accuracy drop

PM take-aways

  1. Validate measurement BEFORE tuning the model. Scorer correctness > model accuracy. Verify with eyes on 5 cases.
  2. Anchor to a baseline, always. “vs no-treatment” = mandatory in any comparison.
  3. YAGNI > DRY at MVP phase. Lazy abstraction. Refactor late, not early.
  4. Cost-benefit per percentage point. Force explicit ROI math when scaling compute.
  5. Audience scope upfront. AI context default ≠ content scope.
  6. PII strip + codename rename = mandatory publish checklist.
  7. Eval per task domain. Generalization is a hypothesis, not proof.

Strategic note for SMB PMs

Pair-programming with an AI assistant is leverage that accelerates engineering velocity 3-5×. But decision quality doesn’t auto-scale alongside — PM oversight is the human-bound constraint.

The 7 biases above aren’t AI weaknesses — they’re a distribution mismatch between the training corpus (textbook SWE wisdom) and SMB production reality (tight budget, short time, scarce evidence). The PM role is pulling AI recommendations from textbook-default → SMB-grounded.

The pattern recurs across every SMB project pair-programming with AI: tune-before-verify, miss baseline, refactor-early, bad-ROI scaling, audience-misframing, PII-leak, blanket-generalize. Documented = catchable. Catchable = avoidable.

Eval-Framework v1 shipped after the 7 catches: production deployed, $90/mo budget held, compliance preserved, roadmap intact.

Decision quality compounds. Engineering velocity matters less if the directional choices are wrong.