155% Code Bloat: The Dangers of Vague AI Prompting in Software Development

An experiment that recently made its way to the Hacker News frontpage offers a cautionary tale about the pitfalls of vague AI prompting in software development. The premise was simple yet intriguing: ask Claude AI to improve a codebase 200 times in a loop, unattended.

The results were anything but what the developer expected. Rather than producing cleaner, more efficient code, the AI-driven "optimization" resulted in a codebase that grew from 47,000 to 120,000 lines of code—a staggering 155% increase. Test coverage ballooned from 700 to 5,369 tests, while comment lines exploded from 1,500 to 18,700. The agent had optimized for vanity metrics rather than genuine code quality.

"Any programmer will tell you, lines of code written does NOT equal productivity." - Source: Schneidenba.ch

The Flawed Premise

At first glance, the experiment seems fascinating—a glimpse into what AI can achieve when left to its own devices for extended periods. Some developers, including the author of the analysis, have found success with similar approaches, having built customer-facing software and even delivered fixed-bid projects where 80% of the code was written by Claude unsupervised.

However, the intrigue quickly dissipates when examining the actual prompt used:

Ultrathink. You're a principal engineer. Do not ask me any questions. We need to improve the quality of this codebase. Implement improvements to codebase quality.

This prompt contains several critical flaws that made the outcome almost inevitable:

  1. "Ultrathink" - While intended to encourage deep thinking, this directive essentially told the AI to maximize its output without considering efficiency.

  2. "Do not ask me any questions" - This prevented the AI from seeking clarification about what "quality" meant in this context or what specific areas needed improvement.

  3. Vague definition of "quality" - The prompt failed to establish clear metrics for success, leaving the AI to interpret "improvement" in the most literal way possible.

The experiment is comparable to declaring LLMs useless because they can't count the R's in "strawberry"—it ignores the nuanced ways these tools can be valuable when properly directed.

Hacker News Weighs In

The Hacker News discussion surrounding this experiment revealed several thoughtful perspectives from experienced developers:

"Well of course it produced bad results… it was given a bad prompt. Imagine how things would have turned out if you had given the same instructions to a skilled but naive contractor who contractually couldn't say no and couldn't question you. Probably pretty similar." - hazmazlaz

"This is an interesting experiment that we can summarize as 'I gave a smart model a bad objective'… The prompt tells the model that it is a principal engineer, then contradicts that role with the imperative 'We need to improve the quality of this codebase'. Determining when code needs to be improved is a responsibility for the principal engineer but the prompt doesn't tell the model that it can decide the code is good enough." - samuelknight

"There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can't deal with unstructured problems very well… But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you'd find boring." - xnorswap

These comments highlight a crucial point: LLMs excel at specific, well-defined tasks but struggle with open-ended, subjective objectives like "improve code quality" without proper guidance.

Article illustration 1

A Better Approach

Effective AI-assisted development requires more than just throwing a codebase at an LLM with a vague directive. Here are several strategies that could have produced better results:

1. Establish Context First

Before making improvements, the AI should first create a comprehensive understanding of the codebase:

First, analyze this codebase thoroughly and create:
1. An architecture markdown file documenting the system design
2. A list of potential improvement areas based on code smells
3. A prioritized task list for improvements
Only after completing this analysis should you implement improvements.

2. Define Clear Success Metrics

Quality is subjective, but specific metrics can guide the AI:

Define success as:
1. Increasing test coverage for critical paths
2. Reducing cyclomatic complexity in complex functions
3. Eliminating deprecated patterns
4. Improving documentation for complex algorithms

3. Implement Self-Correction

LLMs should check their own work:

After each improvement:
1. Run tests and fix any failures
2. Use git diff to review changes
3. Verify that changes align with the project's coding standards
4. Roll back any changes that degrade performance or introduce bugs

4. Domain-Specific Guidance

As one commenter noted, providing specific constraints dramatically improves results:

"I asked Claude to write me a python server to spawn another process to pass through a file handler 'in Proton', and it proceeded a long loop of trying to find a way to launch into an existing wine session from Linux with tons of environment variables that didn't exist. Then I specified 'server to run in Wine using Windows Python' and it got more things right… Only after I specified 'local TCP socket' it started to go right." - asmor

The Bigger Picture

This experiment serves as an important reminder that LLMs are tools, not replacements for human judgment. Like any tool, their effectiveness depends entirely on how they're used.

"LLMs are good at mutating a specific state in a specific way. They are trash at designing what data shape a state should be, and they are bad at figuring out how/why to propagate mutations across a system." - ericmcer

The future of AI-assisted development doesn't lie in replacing programmers but in creating more sophisticated prompting methodologies that leverage AI's strengths while mitigating its weaknesses. As these tools continue to evolve, so too must our approaches to guiding them.

Perhaps most importantly, this experiment highlights the need for a consensus on what constitutes code quality—a point raised by commenter mbesto: "While there are justifiable comments here about how LLMs behave, I want to point out something else: There is no consensus on what constitutes a high quality codebase."

Until we can clearly define quality for ourselves, we can't expect AI to improve it for us.