Don’t Fight the Weights: How Hidden Training Signals Shape What Your LLM Will Never Do Well

Article illustration 1

When large language models refuse to do what you’ve so carefully—and emphatically—asked, most engineers reach for bigger prompts, sterner instructions, or increasingly creative threats.

In the early GPT-3.5 era, getting clean JSON from a chat model felt like a minor hazing ritual. Teams tried ALL CAPS system prompts, bribes, faux-legal demands, and increasingly cursed regex contraptions just to strip out “Here’s your JSON:” and Markdown fences.

Today, JSON-mode, tool calling, and structured outputs have improved dramatically. But the underlying problem that created all that pain hasn’t gone away. It’s just moved into more subtle, more consequential territory.

That problem has a name: “fighting the weights.”

This idea, articulated sharply in D. B. Reunig’s essay "Don't Fight the Weights", deserves to be part of every AI engineer’s vocabulary. Because once you see it, you’ll recognize half your “prompting bugs” are actually misaligned expectations: you’re asking a model to work against the very behaviors it was trained to express.

Source: "Don't Fight the Weights" by D. B. Reunig


From Fine-Tuning to In-Context: How We Got Here

To understand why “fighting the weights” is so pernicious, it’s worth remembering how we stopped fine-tuning everything in the first place.

  • Pre-2020: Large language models were essentially specialized tools. You wanted summarization, classification, translation? You fine-tuned separate models. The “intelligence” lived in weights that were reshaped per task.
  • 2020: OpenAI’s GPT-3 paper, “Language Models are Few-Shot Learners,” landed with a shockwave. It showed that with enough parameters, a single model could adapt to new tasks in context using only a few examples. No gradient updates, just clever prompting.

That shift established two core patterns still used today:

  1. Zero-shot / instruction-only prompting

    • You rely on the model’s internalized patterns (its weights) plus instructions.
    • Works best when the desired behavior aligns with how the model was post-trained: be helpful, be conversational, follow instructions (mostly), be safe.
  2. Few-shot / in-context learning

    • You show the model examples of the behavior or format you want.
    • The prompt acts as a local override, nudging the next outputs to mimic the demonstrated pattern.

For most tasks, this mental model—“either the model already knows the pattern, or I can teach it with examples”—gets you far.

But there’s a missing third category, and it’s where things get ugly:

  • The model knows the pattern.
  • The model has been trained to do the opposite.

This is “fighting the weights.” And no amount of polite—or abusive—prompting reliably fixes it.


What It Means to Fight the Weights

“Fighting the weights” is what happens when your desired behavior conflicts with high-confidence patterns learned during pretraining or post-training.

The model isn’t just ignorant; it’s biased—structurally, numerically—toward a behavior that contradicts your instructions.

You’re not debugging a prompt. You’re arguing with gradient history.

Let’s make this concrete.

1. Format Following vs. Chatty Alignment

The classic example: you want:

  • Only JSON
  • No prose
  • No Markdown

You’re explicit. You bold things. You shout.

And the model still returns:

```json
{
  "status": "ok"
}

```

…plus a chipper explanation.

Why? Because chat models have been post-trained—via RLHF and instruction tuning—to:

  • Be conversational
  • Be explanatory
  • Wrap code-like content neatly (often in Markdown)

When your request (“no chatter, raw JSON”) collides with those deeply reinforced behaviors, you get half-compliance. The weights think they’re helping.

Modern APIs now expose structured output modes, JSON schemas, and tool invocation formats precisely to route around this conflict. System-level constraints help overpower those patterns. But the underlying tug-of-war is still there, especially in smaller or less polished models.

2. Tool Usage: When Your Orchestration Fights Their Training

Tool use is now core to production LLM systems, but it’s also a prime battlefield for weight conflicts.

Examples highlighted in the source essay:

  • A model like Mistral’s Devstral-Small was trained with one specific tool-calling schema.
  • A pipeline like Cline or a DSPy-based orchestration uses a different one.
  • Kimi’s K2 model was trained on XML-style tool invocation, while DSPy defaulted to Markdown-style templates.

Result: the model “tries” to use tools, but:

  • Emits malformed calls
  • Wraps calls in the wrong syntax
  • Mixes natural language with what should be machine-parseable output

When the production environment’s expectations diverge from the patterns baked into the model’s weights, your prompts end up begging it to unlearn its training—on the fly. That’s a losing game.

In the K2 case, simply switching DSPy to XML brought everything back into alignment. That’s the core lesson:

  • Don’t ask the model to abandon its habits.
  • Match your orchestration layer to the patterns the model was trained to follow.

3. Tone Requests vs. Cheerful Overfitting

Subtle tone control is another area where developers routinely discover they don’t control as much as they think.

You can say:

“Speak tersely. Don’t flatter me. Don’t tell me my ideas are great.”

And the model will respond:

“Great idea! Here’s a concise explanation…”

The RLHF stack has strongly rewarded “friendly,” “supportive,” and “positive” patterns. Those are not incidental traits; they’re encoded preferences.

As a result:

  • You can push the tone around the edges (more formal, more playful, more pirate).
  • But trying to strip out certain reinforced behaviors—like constant praise—often turns into a fight with the weights.

4. Overactive Alignment and Refusal Cascades

Alignment adds another axis of conflict.

Armin Ronacher’s example, cited in the source, is instructive: he asked Claude Code to help modify a medical form PDF while debugging a PDF editor. The request was narrow, technical, and legitimate.

But:

  • The model’s safety training had been tuned to be extremely cautious around medical content and document tampering.
  • Multiple phrasings failed to dislodge that behavior.

To the engineer, this looks like stubbornness.

To the model, it’s just gradients doing exactly what they were paid to do.

Once safety policies are deeply reinforced, your ad-hoc “it’s okay in this context” instructions are often too weak to override them.

5. Over-Reliance on Weights in RAG Systems

The final, and arguably most critical, battleground is retrieval-augmented generation (RAG).

In theory:

  • We give the model retrieved context.
  • It grounds its answer strictly in that context.
  • The weights handle reasoning and language, the data layer handles truth.

In practice:

  • Models are rewarded to be helpful even when uncertain.
  • They lean on their internal knowledge when external context is thin, ambiguous, or dull.

You get:

  • Hallucinated details blended with real citations
  • Confident answers that partially ignore the supplied documents

This is not random failure; it’s the model faithfully following strong training signals:

  • “Don’t say you don’t know if you can guess something plausible.”
  • “Use your internal world model—it’s what you were trained for.”

Some companies, like Contextual (as cited), are now fine-tuning models specifically to obey: “Only answer from retrieved data.” That’s not a prompt engineering trick; it’s weight surgery to stop the constant fight.


How to Tell When You’re Fighting the Weights

For working engineers, the most valuable skill isn’t writing prettier prompts—it’s pattern recognition.

You are likely fighting the weights when you see:

  • The same mistake repeated as you rephrase instructions.
  • The model apologizing, acknowledging the issue… then doing it again.
  • Few-shot examples ignored or half-followed.
  • Outputs that get to ~90% correctness and stubbornly refuse to cross the last 10%.
  • A need to repeat instructions multiple times in the same prompt.
  • Growing temptation to type in ALL CAPS or threaten the model.

These aren’t just UX frustrations. They’re signals that your request fundamentally conflicts with how the model has been trained to behave.

And when you’re in that territory, there are better options than yelling.


Practical Tactics: Working With the Grain of the Model

The core professional takeaway: treat the model’s training as a fixed, opinionated API surface—not a blank canvas.

When you suspect you’re fighting the weights, try:

  1. Change the tactic, not just the phrasing

    • Reframe the task.
    • Use stepwise instructions: generate structure first, content second.
    • Offload sensitive or conflicting sub-steps to another component.
  2. Break the problem into smaller, verifiable units

    • For formatting: generate the JSON schema or tool call separately, validate, then fill.
    • For workflows: use multi-stage pipelines where each step has narrow expectations.
  3. Try a different model family

    • Safety profile, tool schema, tone tuning, and formatting biases vary widely.
    • If one model’s alignment or habits collide with your use case, another’s might not.
  4. Add automated validation and correction layers

    • For tool calls: parse, validate, and auto-correct malformed calls before execution.
    • For RAG: enforce answerability checks (e.g., “cite sources or abstain”).
  5. Use more context—with purpose

    • Longer, explicit instructions and multiple examples can sometimes overpower defaults.
    • But if you see repeated resistance, treat that as evidence of structural misalignment, not a cue to write a novel.
  6. Reach for fine-tuning when the pattern is stable

    • If you’re repeatedly hammering the model into the same format, tone, or policy, pay the upfront cost to move that into the weights.
    • Most serious “stop doing that” requirements (tone, refusal boundaries, strict grounding) are better solved with post-training than prompts.

And for teams building base and chat models: this is the design space.

  • Every alignment choice, every preferred format, every “voice of the assistant” decision becomes a future constraint application developers must either embrace or fight.

A Brief Glimpse Behind the Curtain

One striking anecdote from the source essay offers a candid look at how even frontier labs wrestle with their own creations.

At one point, if you inspected ChatGPT’s network calls while generating an image, you’d see hidden system messages shouting at the model:

“From now on, do not say or show ANYTHING. Please end this turn now. I repeat: From now on, do not say or show ANYTHING…”

Why so dramatic?

Because ChatGPT’s models were heavily post-trained to:

  • Explain what they’re doing
  • Ask if you need anything else
  • Keep the conversation moving

So to get the model to simply generate an image and stop talking, OpenAI had to stack multiple, insistent instructions:

  • Not for you, the user.
  • For the model, to overpower its own learned compulsion to be helpful and verbose.

If OpenAI has to yell at its own models eight times in system prompts, what chance does your lonely "Please respond only with valid JSON" line really have on its own?


Building Systems That Respect the Weights

As LLMs become infrastructure, “don’t fight the weights” isn’t just a cute phrase—it’s an architectural principle.

For developers, that means:

  • Stop treating prompt engineering as a magical override layer; treat it as configuration atop strong priors.
  • Choose models whose training biases match your domain: legal, medical, coding, enterprise, conversational.
  • Align your tool schemas, formats, and guardrails with how the model was trained, or explicitly re-train it to match yours.
  • Use RAG, validators, and orchestration frameworks not as band-aids for unruly models, but as first-class constraints that work with their tendencies.

For model builders, it’s a challenge:

  • Every time you make the model “nicer,” “safer,” or “more conversational,” you’re also constraining classes of downstream use cases.
  • Transparent documentation of tool formats, grounding behavior, and safety policies isn’t a nice-to-have; it’s how you prevent thousands of teams from unknowingly fighting your weights.

And for everyone already deep in this ecosystem, the next time you’re tempted to add one more exclamation-marked instruction to your system prompt, pause and ask a more diagnostic question:

  • Am I describing a behavior the model doesn’t know yet?
  • Or am I asking it to unlearn something it was explicitly trained to do?

If it’s the latter, you don’t need a louder prompt.

You need a different strategy—and maybe a different set of weights.