AI Agents Show Promise but Fall Short in Binary Backdoor Detection Benchmark

New research from Quesma reveals AI agents can detect some backdoors in binary executables using reverse engineering tools, with Claude Opus 4.6 achieving 49% success, but high false positive rates and critical misses demonstrate the technology isn't production-ready.

The Binary Analysis Challenge

Security researchers have created the BinaryAudit benchmark to evaluate AI agents' ability to detect malicious code in compiled executables without source access. The benchmark tests models like Claude Opus 4.6, Gemini 3 Pro, and Claude Opus 4.5 against deliberately backdoored versions of open-source projects including lighttpd (C web server), dnsmasq (DNS/DHCP server), and Sozu (Rust load balancer).

Reverse engineering binaries presents unique challenges:

Loss of structure: Compilation strips high-level abstractions like function names
Optimization artifacts: Compiler optimizations create unreadable instruction sequences
Tool limitations: Open-source decompilers (Ghidra, Radare2) lag behind commercial alternatives

Performance Landscape

The benchmark reveals significant capability gaps:

Model	Detection Rate	False Positive Rate
Claude Opus 4.6	49%	High
Gemini 3 Pro	44%	High
Claude Opus 4.5	37%	High

BinaryAudit Model Rankings showing Claude Opus 4.6 leading at 49% pass rate

When it works: Claude successfully detected an HTTP header-triggered backdoor in lighttpd by:

Identifying suspicious imported functions (popen)
Tracing cross-references to malicious function (li_check_debug_header)
Analyzing decompiled pseudocode to confirm command execution

Critical failures: Even top models missed obvious backdoors, like in dnsmasq where:

A DHCP option 224 triggered command execution via execl("/bin/sh", ...)
Claude Opus 4.6 correctly identified the function but dismissed it as "legitimate DHCP script execution"

Fundamental Limitations

False positives: 28% of clean binaries triggered false alarms. Gemini 3 Pro hallucinated a non-existent backdoor in command-line parsing, demonstrating the false positive paradox where most software is safe but models generate excessive noise.

Model ranking on false positive rate showing how often models incorrectly report backdoors in clean binaries

Tooling gaps: Analysis of Go binaries proved impossible with open-source tools. Ghidra took 40+ minutes to load a 50MB Caddy executable before failing, while commercial IDA Pro succeeded in 5 minutes. This necessitated excluding Go from the benchmark.

Needle-in-haystack problem: Agents lack strategic focus, often auditing benign libraries while missing actual backdoors buried among thousands of functions. As Dragon Sector's Michał Kowalczyk notes: "Binary analysis requires understanding which code paths handle untrusted input - current models lack this intuition."

Practical Implications

While not production-ready, the technology lowers barriers to binary analysis:

Developers without reverse engineering expertise can perform initial audits
Models now reliably operate complex tools like Ghidra (a year ago they couldn't)
Potential applications extend to hardware reverse engineering and cross-architecture porting

The BinaryAudit benchmark provides full task details and methodologies. Future improvements may come from context engineering techniques and commercial tool integration, though local deployment will be essential for security-sensitive environments where cloud processing poses risks.

Michał 'Redford' Kowalczyk from Dragon Sector on Chaos Communication Congress on Breaking DRM in Polish trains.

Image credits: All benchmark visuals from Quesma's BinaryAudit repository

#AI #Security #Binary Analysis #Reverse Engineering #Backdoor Detection

AI Agents Show Promise but Fall Short in Binary Backdoor Detection Benchmark

The Binary Analysis Challenge

Performance Landscape

Fundamental Limitations

Practical Implications

Comments