Leadership in AI-Assisted Engineering - InfoQ
#Regulation

Leadership in AI-Assisted Engineering - InfoQ

Rust Reporter
5 min read

Justin Reock, Deputy CTO of DX, breaks down the real-world impact of AI on engineering teams using DORA, SPACE, and DX research, addressing the 95% GenAI pilot failure rate and offering leaders actionable strategies to measure ROI, reduce developer fear, and integrate AI across the SDLC without compromising code safety or throughput.

Featured image

As a Rustacean, I value two things above all else in engineering: safety and performance. These priorities guide how I write code, and they apply equally to how engineering leaders should approach AI-assisted development. Justin Reock, Deputy CTO of DX, laid out exactly this case in his recent QCon AI presentation, "Leadership in AI-Assisted Engineering," backed by hard data from DORA, SPACE, and DX's own research. The talk cuts through the hype around AI's impact on engineering, addressing the "GenAI Divide" where 95% of pilots fail, and offers actionable strategies to measure ROI, reduce developer fear, and integrate AI without sacrificing code safety or team throughput.

The current discourse on AI's productivity impact is noisy. Google reports 10% productivity gains for engineers using AI, but the METR study found a 19% decrease, even as every engineer in the study reported feeling more productive. This gap between perception and reality is why measurement matters, a core theme Reock emphasizes. We cannot rely on qualitative self-reports; we need quantitative metrics tied to safety and performance.

Leadership in AI-Assisted Engineering - InfoQ

The DORA community's recent report illustrates this need. Their Bayesian Posterior Distribution of AI impact shows that clear AI policies and dedicated time to learn have the strongest positive effects on team outcomes. DORA found that a 25% increase in AI adoption correlates to a 7.5% improvement in documentation quality, 3.4% better code quality, and 3.1% faster code reviews. However, per-company data shows high variability. Some organizations see 20% gains in change confidence, while others see 20% losses. For Rust developers, this mirrors the tradeoff between unsafe code speed and safe, reviewed code. Skipping reviews to move faster increases change failure rate, just as over-relying on AI without checks does.

To capture this complexity, Reock recommends the SPACE framework, which measures Satisfaction, Performance, Activity, Communication, and Efficiency. Over-indexing on a single metric like AI utilization leads to Goodhart's Law, where teams game the system. For example, mandating 100% AI utilization might lead engineers to update READMEs just to hit the target, with no real productivity gain. Safety comes from balancing speed with code maintainability; performance comes from tracking both quantitative (PR throughput) and qualitative (developer satisfaction) metrics. DX's own data shows moderate AI users have 2.6% higher change confidence, 2.2% better code maintainability, and 0.11% lower change failure rate than non-users. This aligns with Rust's ownership model. Clear rules, like system prompts for AI, prevent unintended side effects, just as ownership prevents memory safety issues.

A key challenge Reock highlights is the GenAI Divide. A MIT NANDA group study found 95% of AI pilots fail, mostly because organizations use general-purpose LLMs instead of task-specific agentic solutions. General-purpose models have far lower success rates for embedded workflows. For performance, this means leaders should prioritize agents for specific SDLC bottlenecks, such as legacy code reverse engineering, over broad, mandatory LLM adoption. Safety is preserved by avoiding generic models for critical code generation.

To measure AI impact effectively, Reock presents the DX AI Measurement Framework, built on the Core 4 (a distillation of DORA, SPACE, and DevEx). The framework tracks three dimensions: Utilization, Impact, and Cost. Utilization metrics (daily active users, percentage of AI-assisted PRs) only tell you who is using the tool, not if it works. Impact metrics (PR throughput, code maintainability, change confidence) measure safety and performance. Cost metrics prevent runaway token spending. Microsoft, for example, tracks "bad developer days" (a telemetry-based metric of developer frustration) correlated with AI use. Dropbox found 23% of committed code is now AI-generated, with corresponding throughput gains. This framework avoids the Rust equivalent of measuring lines of code written, a useless metric, and instead tracks meaningful safety and performance outcomes.

Technical controls are equally important for safety. System prompts (called Cursor Rules or Agent Markdown in some tools) enforce organizational coding standards. A system prompt requiring Spring Boot 3.0+ eliminates deprecated method errors in AI outputs. Temperature settings control output randomness. Low temperature (0.0001) gives near-deterministic code output, critical for strict code generation. High temperature (0.9) enables creative brainstorming. For Rust developers, this is similar to clippy lints or rustfmt. Automated rules enforce safety and consistency across the codebase.

Cultural factors are just as critical. Reock stresses framing AI as augmentation, not replacement. Google's Project Aristotle found psychological safety is the top predictor of team performance. Fear of replacement leads to hidden errors, gaming metrics, and reduced collaboration. Reock noted SWE-bench shows top models only complete 44% of tasks without human intervention, so replacement is not feasible. Performance gains come from augmentation. Zapier increased hiring because AI boosts individual engineer throughput 10-15%, and cut daily standups from five times weekly to two. This aligns with Rust's goal of empowering developers. The borrow checker does not replace the developer, it augments them to write safe code faster.

Most organizations focus AI adoption on code generation, but Reock's study of 135,000 engineers found this is rarely the bottleneck. Engineers spend only 5-6 hours weekly writing code. Bigger time sinks are context switching, meeting load, and legacy code reverse engineering. Integrating AI across the entire SDLC delivers far larger gains. Morgan Stanley's DevGen.AI saves 300,000 annual hours by reverse engineering legacy COBOL and Perl specs. Spotify's incident management agent handles 90% of incidents by gathering runbook context instantly. Faire automates 3,000 weekly low-sophistication code reviews, catching errors faster. For Rust teams, this means applying the same safety-first mindset to AI adoption as we do to code. Measure, enforce rules, prioritize throughput over cost-cutting.

Reock's next steps for leaders are clear. Distribute DX's 65-page AI guide, which compiles best practices from engineers saving at least one hour weekly with AI. Use oppositional metrics that balance speed and quality, rather than hyper-focusing on a single dimension. Iterate on use cases based on data, not hype. The path to successful AI adoption is not magic tools, but rigorous measurement, clear rules, and trust in your team. As Rustaceans know, safety and performance are earned, not given.

Comments

Loading comments...