Code Smells for AI Agents: Q&A with Eno Reyes of Factory

Factory's CTO Eno Reyes discusses how their coding agent tackles the quality problem in AI-generated code, emphasizing the importance of harness engineering and signal-based validation.

AI agents remain the thing that everyone wants to talk about. But for every agent that claims to write code fast and solve your problems, we hear of massive amounts of slop code that needs to be fixed by hand. Quality software still needs high-quality code, and Factory aims to give you a coding agent that builds signals of high-quality code into the process.

I caught up with Eno Reyes, co-founder and CTO of Factory, at re:Invent last December to talk about how their frontier coding agent gives the major players a run for their money.

Q: Tell us a little bit who you are and what Factory does.

Eno Reyes: I'm the co-founder and CTO of Factory. At Factory, we're building a platform that helps large engineering organizations build software fully autonomously. Very concretely, what we do is we give them not only a frontier coding agent that lets them do any task across the software development lifecycle, but we also give them tooling to help analyze the quality of their code bases and the impact that these agents have in order to maximize the success of the rollout.

Q: There's a lot of big players in the coding agent space. Why build another coding agent?

ER: We've been working on this for around two and a half, almost three years. A lot of our researchers come from teams that have either built LLMs or built agents in prior lives. In order to build a good agent, you have to have one that's model agnostic. It needs to be deployable in any environment, any OS, any IDE. A lot of the tools out there force you to make a hard trade off that we felt wasn't necessary. You either have to vendor lock yourself to one LLM or ask everyone at your company to switch IDEs. To build like a true model agnostic, vendor agnostic coding agent, you put in a bunch of time and effort to figure out all the harness engineering that's necessary to make that succeed, which we think is a fairly different skillset from building models. And so that's why we think companies like us actually are able to build agents that outperform on most evaluations from our lab.

Q: What does the harness engineering entail for something that's connecting to any given IDE, terminal, whatever. How do you make sure that just works?

ER: It's a hard problem because there are several dimensions to this. You have to manage the context. All LLMs have context limits so you have to manage that as the agent progresses through tasks that may take as long as eight to ten hours of continuous work. There are things like how you choose to instruct or inject environment information. It's how you handle tool calls. The sum of all of these things requires attention to detail. There really is no individual secret. Which is also why we think companies like us can actually do this. It's the sum of hundreds of little optimizations. The industrial process of building these harnesses is what we think is interesting or differentiated. Our team has built out ways to know what a good harness looks like, what a bad outcome looks like. By bringing that information about what good and bad harnesses generally look like, you can then systematize the way you improve. Because we also have coding agents, we can automatically upgrade and optimize the agents that we build.

Q: The definitions of good and bad are super important to the final result of any software. What is a good harness?

ER: This is actually a funny question because there are so many different signals that you can use in software development. Everything from automated signals that come from whether or not the code that you've written compiles, whether it lints properly, do the tests pass. Is there an associated document that explains how the code works properly? There's hundreds of these signals that you can evaluate. In software, you'd expect everybody to make use of all of these signals, but in reality, most organizations have very little of this signal fully implemented in their code bases. We spend a lot of effort trying to identify these hundreds of different validation signals that software could have. Not only do our native evaluations have that, but when we help organizations deploy coding agents and software development agents, we help them understand those signals as well. If they're missing those signals in their own code bases, they can use Droids to actually bring those signals in. It improves the quality of our agent in particular, because our agents are optimized for those signals.

Q: So it's not just the agent, it's the tooling around the agent, right? A software developer has a ton of tooling that gets them from writing code to production. What's the sort of tooling that an AI coding agent needs?

ER: Any developer or organization can take advantage of linters and static type checkers. Of course end-to-end and unit tests. There are auto formatters that you can bring in, SaaS static application security testers and scanners: your sneaks of the world. Anything that runs an audit of the code says, green or red, good or bad. Maybe it gives you a score, like code complexity. GitHub just put out a really cool code quality analyzer. Agents can use all of these to improve their own quality of work. When you let an agent loose on code, if you don't want a human to get involved, it needs to get that signal from something. Our view is that autonomy is gonna happen from bringing in more and more of those signals automatically rather than from humans.

Q: Instrumenting a sort of data observability, signal production is an important thing for better agents. You said the agents can help with that. How does that work?

ER: Most humans are largely capable of getting by without a lot of this stuff. If you don't know why code should be formatted that way, the senior staff engineer that's sitting next to you will review your code and say, Hey, you did this wrong. That's fine with humans. But if you really want to scale up, bringing on agents isn't hiring another person. It's like hiring a hundred intern-level engineers. You can't code review a hundred engineers, right? You need something else. If Droid, our autonomous coding agent, runs this autonomy maturity analysis, and it finds all of these signals that don't work, you can say to Droid, fix those six missing signals. As a developer, that's really where your opinions should be—how should I use this linter or that formatter. Once you've decided on that as a team, then the feedback loop starts to accelerate.

Q: A lot of senior developers talk about code smells. Is it possible to systematize code smells automatically? Or does somebody need to go in and say "That's a bad practice"?

ER: That's a great question. Having a fully autonomous codebase isn't just automations that rely on static things, but also you can bring AI automations in your software development lifecycle. Like for us Droids conform to different workflows like code review and incident response and documentation to become automations. You can actually plug a Droid into your GitHub actions pipeline, and it becomes a code review tool. If you want to check for specific code smells, you have a fully customizable code review Droid. Similarly, your documentation, your incident response, you just plug the Droid into any cron job, any VM you have—you could even run it on a loop on a laptop if you really wanted to. I think for the code smell aspect, people will bring AI agents to handle the fuzzy non-statically determinable practices.

Q: Some folks have been worrying about AI agents causing a rise of work slop, low-quality code that needs to be fixed manually. How do you make agents that don't contribute to that problem—that are a net benefit?

ER: There's a great body of research out of Stanford that looks at the impact on codebases and productivity of AI. They looked at all these different signals. They looked at the volume of code being generated by AI. They looked at adoption. They looked at the density of power users, like maybe this org has a higher density of people who use it a ton. Then they just looked at the baseline quality of the code base and tried to determine which one of these predicts whether or not AI will accelerate this company or decelerate this company? Because AI is actually decelerating some companies, right? It turns out that none of the things around volume of code or number of coding agents or coding agent penetration correlated at all with productivity. The only signal was code quality. The higher-quality of the code base, the more the AI accelerated the organization. It's very intuitive, right? What is AI? A great pattern recognizer. So great code in the pattern means great code out. We see a lot of organizations decelerating because they have bad standards and bad quality code when they bring AI agents in. We can give you a fantastic agent but we also give you the tooling to determine whether you have a bad quality codebase or a good one.

Q: How are AI agents changing the nature of work?

ER: As general software development agents get better and better at many tasks, you start to see the whole world as a software task. What is building a PowerPoint for your sales team but a software task? What is doing customer research or answering customer responses on really complex documentation? It's a software task. We're bullish on the notion that the best general agents are just the best software development agents. Increasingly, it's not just pure software engineers, but product managers, data scientists, and even folks who sell software who are all starting to realize the capability of software development agents, which has been a super interesting trend to see.

#AI_Agents #Code Quality #Software Engineering #LLMs #Automation

Code Smells for AI Agents: Q&A with Eno Reyes of Factory

Comments