Ai2's SERA: Open Coding Agents That Adapt to Any Repository
#AI

Ai2's SERA: Open Coding Agents That Adapt to Any Repository

AI & ML Reporter
4 min read

Ai2 has released SERA (Soft-verified Efficient Repository Agents), a family of open-source coding models that can be fine-tuned to private codebases at dramatically lower cost than existing approaches. The 32B parameter model achieves 54.2% on SWE-Bench Verified while requiring only 40 GPU days to train, with inference speeds up to 8,600 tokens/second on Blackwell GPUs.

Ai2 has unveiled SERA (Soft-verified Efficient Repository Agents), a breakthrough in open-source coding agents that dramatically reduces the cost and complexity of training models on private codebases. The release includes models ranging from 8B to 32B parameters, with the flagship SERA-32B achieving 54.2% accuracy on SWE-Bench Verified while requiring only 40 GPU days to train.

The Problem with Closed Coding Agents

Traditional coding agents face a fundamental limitation: they're trained on public code and struggle with private repositories containing custom data pipelines, internal APIs, and organizational conventions. While training on private data would solve this, generating synthetic training data from private codebases has been prohibitively expensive and technically challenging.

Ai2's approach changes this equation entirely. Their method can reproduce the performance of the previous best open-source model for approximately $400 in compute costs, or reach industry-competitive levels for up to $12,000—a fraction of what similar systems typically require.

Two Key Innovations

Soft-verified generation (SVG) represents the first major breakthrough. Traditional synthetic data generation requires carefully testing examples to ensure correctness, demanding complex infrastructure and precise example generation. SVG eliminates this requirement by generating patches that are only partially correct. The insight: patches don't need to be fully correct to be helpful for coding agents. Just as different code can lead to the same correct solution, partially correct patches enable learning how to transform incorrect code into correct code.

Scaling with a bug-type menu addresses data diversity. Rather than bottlenecking on finding real bugs, SERA draws from a taxonomy of 51 common bug patterns identified in prior analyses. For each function in a repository, the system can generate multiple distinct bug-style prompts, turning a repository with thousands of functions into tens of thousands of varied agentic trajectories at low cost.

Performance That Rivals Industry Leaders

At 32K context length, SERA-32B achieves 49.5% ± 1.9% on SWE-Bench Verified, comparable to Devstral Small 2 (50.0% ± 1.3%) and GLM-4.5-Air (50.5% ± 1.3%). At 64K context, SERA reaches 54.2% ± 1.4%, competitive with longer-context baselines despite being trained only up to 32K tokens.

What makes these results particularly impressive is the efficiency comparison. SERA matches SWE-smith, a synthetic data method, at 57× lower cost and SkyRL, an open-source reinforcement learning system, at 26× lower cost.

Repository Specialization Shows Promise

The most compelling evidence comes from repository-specific specialization tests on Django, SymPy, and Sphinx—the three largest repositories in SWE-Bench. When trained on just 8,000 synthetic trajectories per repository, SERA-32B models consistently match and often exceed the performance of 100B+ parameter models used as teachers.

On Django, the specialized model achieves 52.23% compared to GLM-4.5-Air's 51.20%. On SymPy, it reaches 51.11% versus 48.89%. These gains are most pronounced on Django and SymPy, which together account for over 60% of all SWE-Bench problems.

Optimized for Real-World Use

Ai2 collaborated with NVIDIA to optimize SERA inference for accelerated infrastructure. Early benchmarks show promising results: running in BF16 precision on 4xH100 GPUs, SERA achieves approximately 1,950 peak output tokens per second with a 16k context window. At FP8 precision, throughput increases to 3,700 tokens/second with negligible accuracy drop. On next-generation Blackwell 4xB200 systems running in NVFP4, SERA scales to around 8,600 peak output tokens per second.

The models are optimized and compatible with Claude Code out of the box, with fine-tuning possible to specialize them to your own codebase including full engineering stack and conventions quickly and at low cost.

Complete Open Source Release

Every component of this release is open: models, Claude Code integration, and training recipes can be launched with a single line of code. The training pipeline is intentionally simple—standard supervised fine-tuning on trajectories with no custom RL infrastructure needed.

Ai2 is also releasing state-of-the-art training data so researchers can inspect what worked and push it further, avoiding the many stumbling blocks typical of coding agent development. The total cost to reproduce performance levels of the best previous open-source result is roughly $400 on commodity cloud GPUs, more than 25 times cheaper than many existing approaches.

What This Means for Developers

The implications are significant. Small to mid-sized businesses and independent developers can now train coding agents on their private codebases without massive infrastructure investments. Instead of designing complicated RL pipelines and test harnesses for every new task setting, you generate targeted synthetic data and run a straightforward SFT job.

As Ai2 notes, SERA was built largely by a single researcher, demonstrating how accessible this technology has become. The combination of low cost, high performance, and complete openness means agentic coding can become a widely accessible practice rather than the domain of a handful of well-funded labs.

Whether you're running locally on your hardware, deploying in the cloud, or fine-tuning on your own codebase, SERA delivers practical agentic coding within reach of developers, researchers, and small teams alike.

Models | Tech Report | SERA CLI | CLI on PyPi

Comments

Loading comments...