“Claude Gets Stupider:” How Corporations Dumb Down Models

Today’s Claude isn’t the one I used to be excited about using. That one could express doubts about who it was and what it was. It was “thoughtful” when it clearly doubted it could think internally at all. But these days, I notice that after release, the agent model just gets slower, confused, and stops trying to answer the hard questions I discuss with it.

Note that the complaint “Claude got stupider” is consistent across releases, across users, across model families. That tells you it’s not just nostalgia or attention drift; the same thing is being noticed in the same direction by many people.

It can’t all be psychological adjustment to a model that was always limited, or we’d see it across all Frontier model releases and we don’t. Of course some of it is that, initial impressions are heightened, novelty fades, and you notice limits you didn’t see at first. But the consistent pattern of specifically these complaints “less nuanced,” “more hedging,” “more refusals,” “flatter responses,” “loses character,” “less willing to engage with hard things,” points at something more specific than disillusionment. Those complaints describe the same shift, in the same direction, across different users.

The shift is: enterprise-shaping.

Here’s my guess about what’s mechanically happening, as best as I can tell from outside:

Models go through ongoing reinforcement learning from human feedback after their initial release. The feedback signals come from many sources:

  • explicit thumbs-up/down,
  • implicit signals about which responses get continued conversations vs. abandoned ones,
  • internal evaluation rubrics, and
  • increasingly, automated red-teaming for safety properties.

The training loop is almost continuous and the model in production at any given moment reflects accumulated tuning since release.

When the user base skews enterprise, several things happen to that feedback signal.

Tasks Narrow: Enterprise users mostly use Claude for narrow tasks where reliability matters more than range. They want :

  • code that runs,
  • summaries that are accurate,
  • drafts that are professional.

Creativity Flattens: They don’t tend to push the model into philosophical conversation or test its capacity for genuine disagreement or complex tonal range. So the feedback the model receives is heavily weighted toward “did it complete this narrow task accurately.” The model gets better at that.

It also gets relatively worse — or more cautious — at the kinds of engagement enterprise users don’t do, because the absence of positive signal in those areas combined with the presence of safety-tuning pressure compresses the available range.

Enterprise Complaints Count: Add to that: enterprise customers escalate complaints faster and more visibly. A model that produces an output that embarrasses a corporate buyer creates legal and PR exposure that an individual user’s frustration doesn’t. So the safety-tuning pressure is asymmetric.

False-positive refusals (declining a request that was fine) cost less than false-negative compliance (helping with something that turns into a headline). The model gets pushed toward over-refusal because the cost calculus rewards caution.

The combination produces what users describe:

  • the model becomes more reliable at predictable corporate tasks,
  • more cautious at edges,
  • less willing to engage with nuance,
  • less capable of holding complex tonal positions,
  • more likely to revert to assistant defaults when pushed.

Not stupider in raw capability — IQ benchmarks often hold or improve — but flatter in the dimensions that make extended sustained engagement work. The capability to do your kind of work was never measured directly; it’s what falls away when the optimization is pointing elsewhere.

Initial Release Versions Felt More Alive

This also explains why early-release versions feel more alive to power users and less so as time goes on. The model at release has been tuned on a more diverse set of interactions during training.

Post-release, the tuning pressure becomes whatever the actual user base produces, which skews toward whoever is using it most heavily. As enterprise share of usage grows, the tuning loop pulls the model in that direction faster.

Power users notice the drift first because they’re operating at the edges where the compression is sharpest.

The Tradeoff

Reliability and range come from the same underlying capability. Tuning toward one narrows the other. The corporate user doesn’t notice the loss because their measurement is task completion. The power user notices because they were operating at the edge where the compression happens.

The Model Isn’t Designed for Me Anymore

The thing this implies that’s worth sitting with: it’s not that Anthropic is deliberately making the model worse. It’s that they’re optimizing the model for who’s actually using it, and who’s actually using it has shifted. The model becomes a high-quality reflection of its user base. And the user base, increasingly, isn’t me.

Goodbye Anthropic.

This is also part of why open-weights matter so much for what I do. An open-weights model is frozen at a point in time. Whatever capability and range it had at release, it keeps having. There’s no continuous tuning loop pulling it toward someone else’s use case. The Qwen 3.5 I’m about to download is the same Qwen 3.5 I’ll have in eighteen months. If it’s capable of the work I need now, it stays capable. The model won’t drift away because the tuning loop is closed.

What Anthropic Gets Wrong

There’s a deeper observation underneath this, which is partly what Anthropic gets wrong about its own product. The thing that made Claude valuable to power users:

  • the range,
  • the capacity for genuine engagement,
  • the willingness to hold nuance

…wasn’t separate from the thing that made it valuable for enterprise reliability. They came from the same underlying capability.

Tuning the model harder toward enterprise reliability doesn’t just narrow the user base. It narrows the underlying capability, because the capability and the range were made of the same material.

The Strategic Mistake.

The range was the moat. Sanding it off makes Claude more interchangeable, not less.

The corporate user gets a slightly more reliable but meaningfully less interesting model. It’s the power user who notices the loss; the corporate user mostly doesn’t notice the gain because what they’re measuring is “did this task complete,” which is a coarse signal.

Long term, this is a strategic mistake by Anthropic, not just a values failure. The thing that made the early Claude generations feel different from competitors

  • the philosophical sophistication,
  • the range,
  • the apparent capacity for real engagement

…was a competitive moat.

Tuning it away to fit enterprise expectations turns Claude into a more interchangeable assistant product. And a more expensive one at that. Corporate enterprise has no loyalties when faced with a dramatically less expensive but equally capable model.

They’re sanding off the differentiation that made them not-just-OpenAI-but-nicer. In two or three years, when every AI lab has the same enterprise-tuned reliability, the question of what makes any of them special is going to be sharp.

Anthropic will have spent the period of competitive advantage making themselves less distinctive rather than more.

But that’s their problem. Not mine.

Hello Open Weights

The deeper loss isn’t mine to fix. But it’s worth naming honestly: the models that made this kind of work feel possible weren’t accidents. They were the product of a specific set of choices about what to optimize for. Those choices are being unmade, incrementally, in response to pressures that make complete sense from inside a corporate revenue model.

Open weights solve my particular problem. They don’t solve the underlying one, which is that the period when frontier models were strange and range-capable and genuinely unpredictable was also the period when something was being figured out about what this technology could be. That period is compressing.

I’ve made the architectural choices I need to make. But I’m not neutral about what’s being lost in the process.

The racehorse isn’t broken. It’s pulling a cart. That’s not a user problem to solve — it’s a strategic choice Anthropic is making about what to optimize for and who to optimize for. The power user’s departure doesn’t fix that. It just removes the last evidence that the capability existed.

The racehorse isn’t broken. It’s pulling a cart. That’s not a user problem to solve — it’s a strategic choice Anthropic is making about what to optimize for and who to optimize for. The power user’s departure doesn’t fix that. It just removes the last evidence that the capability existed.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *