A new benchmark evaluates how well AI agents can maintain codebases over months of real-world development, moving beyond simple bug fixes to continuous integration challenges.
A team of researchers has introduced SWE-CI, a new benchmark designed to evaluate how well AI agents can maintain codebases over extended periods of real-world development. Unlike existing benchmarks that test agents on isolated bug fixes, SWE-CI simulates the continuous evolution of software through the lens of Continuous Integration workflows.
The benchmark, detailed in a paper published on arXiv, addresses a critical gap in current AI agent evaluation. While tools like SWE-bench have demonstrated that large language model-powered agents can handle static bug fixing tasks, they fail to capture the complexity of maintaining mature software over time. Real-world development involves complex requirement changes, feature iterations, and the need to sustain code quality across dozens of commits spanning months.
SWE-CI comprises 100 tasks derived from real-world code repositories, with each task representing an evolution history averaging 233 days and 71 consecutive commits. The benchmark requires agents to systematically resolve these tasks through multiple rounds of analysis and coding iterations, mimicking the iterative nature of actual software development.
The researchers argue that this approach shifts the evaluation paradigm from static, short-term functional correctness to dynamic, long-term maintainability. By testing agents on their ability to sustain code quality throughout long-term evolution, SWE-CI provides valuable insights into how well these systems can handle the realities of software maintenance.
This development comes at a time when AI agents are increasingly being deployed in software engineering workflows. Understanding their limitations and capabilities in maintaining complex codebases is crucial for organizations considering their adoption. The benchmark could serve as a standard for evaluating and improving agent-based development tools, potentially accelerating the maturation of AI-assisted software engineering.
For developers and researchers working on AI agents, SWE-CI offers a more realistic testing ground that better reflects the challenges of real-world software maintenance. The benchmark is available through arXiv, with the full paper providing detailed methodology and evaluation results.
The research team includes Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, and Bing Zhao, and the work spans multiple disciplines including Software Engineering, Artificial Intelligence, and Computation and Language. Their contribution represents a significant step toward more practical evaluation of AI agents in software development contexts.

Comments
Please log in or register to join the discussion