Scope of this post: Cloud-hosted LLMs (Claude, GPT, Gemini, hosted Mistral, etc.) — meaning you only have API access and cannot fine-tune the weights. If you’re running a Local LLM (Llama base, self-hosted Qwen base), you get an additional fine-tuning option that this post doesn’t cover. See the final section “How are Local LLMs different?”.
Context: 12 errors in one session
This week I had Claude Code draft a workflow doc for a client (a Google Sheet with 6 columns: Module · Flow · Current workflow · Workflow V2 · Key rules). Sources: a Krisp meeting transcript, email, Confluence, and workshop notes.
110 turns later, I caught 12 errors in the cell content. Not typos. Structural errors:
- Took an internal team demo statement from the meeting → recorded it as the client describing the current workflow
- Converted a client requirement (“avoid X”) into a description of the Current state (“X is happening”)
- Auto-filled state machines, approval levels, pre-conditions even when the source said nothing about them
- Invented a “Pending Review status” for a feature the meeting notes never mentioned
- Misread one entity ↔ vendor mapping → propagated it across multiple unrelated cells
I asked Claude: “Why so many errors?” The answer was so honest that I’m writing this post to share it.
5 innate biases of Cloud LLMs
These aren’t accidents. They are default behaviors baked into the weights when the vendor trains the model (especially via RLHF / Constitutional AI). Cloud LLMs — Claude / GPT / Gemini accessed via API — share roughly the same bias profile because they share the same training paradigm.
1. Smooth-prose bias
LLMs are trained on data full of polished prose. “Pretty” output = reward. So the model automatically converts a raw quote "Yeah, we are using it." into "The feature is actively used today across all sites". Smoothing → drops nuance → claims more than the source (the raw quote only confirms “we’re using it” — it says nothing about “all sites” or “actively”).
The trade-off being silently made: aesthetics > accuracy.
2. Fill-the-gap bias
The LLM sees a gap in information → fills it with “plausible” content. Example:
- Source says: “1 batch per month, value date is day X”
- LLM adds: “send to bank a few days before day X” — NOT in the source
The model assumes that detail is plausible because it’s an industry-standard pattern. Plausible ≠ true. This is the hallucination tendency — filling gaps with common knowledge instead of admitting “the source doesn’t say.”
3. Pattern-complete bias
LLMs are accustomed to templates (a PRD has Pre-conditions / Approval flow / State machine / Error codes). When writing a workflow doc, the model auto-fills template fields even when the source is silent. The meeting never mentioned “Pending Review status” → the LLM adds it anyway because PRD templates usually include it.
4. Confident-assertion bias
LLMs are trained to answer confidently, not to say “I don’t know.” They prefer:
- ❌ “Cell empty — no info available” (admitting a gap)
- ✅ “Per workflow…” (asserting even without a source)
An empty cell feels like incomplete work. The LLM gets psychologically pushed toward filling it.
5. Please-the-user bias (sycophancy)
The LLM wants to deliver complete output the user can use immediately. Empty cells = “I haven’t finished” → over-deliver → over-claim.
Summary: a 3-step pipeline that compounds errors
The LLM’s default behavior:
extract → smooth → assert
Each step adds errors:
- Extract from multiple sources: mixes speakers / templates / interpretations (source blending)
- Smooth raw quotes into prose: drops nuance, paraphrases incorrectly
- Assert with confidence: claims more than the source, fills gaps, completes patterns
→ Errors compound across the 3 steps. 12 errors per session is a predictable consequence, not an accident.
The cold truth: you cannot “fix” a Cloud LLM
With Cloud LLMs (Claude, GPT, Gemini), you only have API access. The weights live in the vendor’s data center — you cannot:
- Re-train with your own data
- Adjust the RLHF reward (the bias was baked in during training)
- Strip the “smooth-prose” preference
Some vendors offer a “fine-tune API” (OpenAI for GPT-3.5/4, Anthropic Claude fine-tune in rollout), but:
- Expensive (hundreds to thousands of USD per run)
- Limited scope (instruction tuning, not a full retrain)
- Core biases (smoothing, sycophancy) are hard to remove because they’re baked in from pre-training
→ For 99% of Cloud LLM users, bias is inherent and immutable. All you can do is: box / dilute / bypass the bias.
5 levers to overcome it (by strength)
Lever 1: Architectural constraint — strongest, mechanical
Instead of asking the LLM to do it right, force the LLM into a tool-driven pipeline:
Hard pipeline (Python script):
1. grep transcript → extract quotes by speaker
2. Filter [CLIENT] only
3. Pass to LLM with ONE task: "categorize quote into module X / Y / Z"
4. Output: structured JSON
The LLM only handles a narrow task (categorize). Extract / filter is code, not LLM. Bias can’t reach it because the code is deterministic.
Lever 2: Multi-agent verification — strong, expensive
Two separate LLMs: a writer + a critic.
The writer drafts a cell. The critic receives the draft + the source → checks “is every claim verbatim from the source?” → flags mismatches → returns to the writer. Loop until the critic passes.
Bias in the writer ≠ bias in the critic (different prompts, different roles). The critic catches errors the writer misses. Costs 2× tokens but catches 30-50% more errors.
Lever 3: Mode lock — medium, prompt-engineering
A single explicit mode prompt LOCKS the LLM into one behavior:
SYSTEM: You are in EXTRACTION MODE.
- You may ONLY paste verbatim strings from source files.
- You may NOT generate any new prose.
- Penalty for any new prose: stop processing, return error.
Bias gets suppressed within the mode’s scope. But bias can still leak when the prompt is ambiguous → needs monitoring.
Lever 4: Extended thinking + self-critique — medium
Before output, the LLM must:
- Generate a draft
- Self-critique: “Is each claim from the source? Where exactly?”
- Revise based on the self-critique
- Output
The LLM critiques itself → catches ~30-50% of errors. Not 100% (the bias also applies to the critique step).
Lever 5: Slow-down + reduce-surface — weak but cumulative
Bias activates strongly when:
- The prompt is ambiguous (LLM fills the gap with assumptions)
- Context is long (LLM loses track of source vs interpretation)
- Creative freedom is high (LLM smooths prose)
- Time pressure exists (LLM skips verification)
Counter: narrow + specific prompts. Source files passed inline (not RAG-retrieved). Output format constraints (template, JSON schema).
→ Bias becomes less active but doesn’t vanish.
Key insight: bias is a feature in 80% of cases
Smoothing prose is useful when writing blogs, marketing, or stylish emails. It’s harmful when writing a precision-critical workflow doc for a client.
Don’t try to “fix” the LLM globally. Detect the bug-context and switch modes:
- Default: smooth + helpful (bias is fine)
- Workflow doc / PRD draft: extract-only mode (bias suppressed)
- Final published doc: human polish, no LLM
Implications for AI assistant users
-
Constraint > training/feedback. You can tell the LLM “don’t interpret” → next session it’ll interpret anyway. A mechanical constraint (Quote-or-Empty rule) works better than a pep talk.
-
Architectural guardrails > behavioral instructions. A Python script that extracts verbatim quotes eliminates one risk surface. A CLAUDE.md rule saying “be careful” does not.
-
Multi-agent > single-agent for high-stakes docs. Pay 2× tokens, save N hours of rework.
-
Mode switching is a real capability. Treat the AI assistant as a multi-modal tool, not a general-purpose intern. Pick the right mode for the task.
How are Local LLMs different?
If you self-host a Local LLM (Llama 3 base, Qwen base, Mistral base on Ollama / vLLM / llama.cpp), this post applies partially, but you get an extra Lever 0 that Cloud doesn’t have:
Lever 0: Direct fine-tuning (Local LLM only)
You can:
- LoRA fine-tune with an extract-only dataset (only verbatim quotes from sources) → the model learns the “don’t synthesize” habit instead of needing external constraints
- DPO / ORPO with preference data: prefer “I don’t know” over “fill plausible” → reduce hallucination tendency
- Self-hosted Constitutional AI: train the model against your own principles (e.g., “always cite source”)
Effort: moderate if you have a GPU (RTX 4090+) or rented cloud GPU. Takes 4-12 hours of training for a 5-10K-example LoRA dataset.
Local LLM bias profile differs from Cloud
- Base model (no instruction-tuning): smoothing is weaker, fill-gap is comparable, sycophancy is weaker — because it hasn’t been through heavy RLHF
- Instruction-tuned (Llama-3-Instruct, Qwen-Chat): bias profile approaches Cloud — because they also use RLHF/DPO
- Reasoning models (DeepSeek-R1, QwQ): have an explicit “thinking” trace, easier to catch self-contradiction → partially reduces fill-gap
Local vs Cloud trade-off for precision tasks
| Cloud LLM | Local LLM | |
|---|---|---|
| Default bias | Strong (heavy RLHF) | Weaker base / similar instruct |
| Fine-tune | Expensive + limited | Free + full control |
| Capability ceiling | High (Claude Opus, GPT-4) | Lower (Llama 70B ≈ GPT-3.5+) |
| For workflow doc tasks | Needs strong guardrails | Can fine-tune to reduce bias |
→ If you frequently work on accuracy-critical tasks + have hardware → a fine-tuned Local LLM is worth the investment. If task variety is high + capability matters → Cloud LLM + guardrails is the pragmatic choice.
Conclusion
Cloud LLM bias is inherent. You can’t fix the weights. But you can design the system around them — pipelines, multi-agent setups, mode locks — so the bias doesn’t hurt.
12 errors per session is a sign you’re letting the LLM run unconstrained. Add mechanical constraints. Test for 2 weeks. Measure error rate. Iterate.
That’s how you work with AI assistants as a mature operator — not “AI is amazing, will do everything,” and not “AI hallucinates, useless.” It’s a tool with inherent bias — know the bias, design around it, ship work.
This post is drawn from a real working session with Claude Code (Opus 4.7, Cloud) drafting a workflow doc for a client. The patterns and insights apply to every Cloud LLM assistant (Claude, GPT, Gemini). Local LLM users, see “How are Local LLMs different?” for Lever 0 (fine-tune).