AI Can't Audit Your Binaries Yet

Quesma just published BinaryAudit, a benchmark for AI-driven backdoor detection in compiled binaries. They partnered with Michał Kowalczyk from Dragon Sector (the researcher who reverse-engineered malicious code in Polish trains) to hide backdoors in real network infrastructure software: web servers, DNS servers, SSH servers, proxies, load balancers. Then they pointed AI agents armed with Ghidra and Radare2 at those binaries and measured what happened.

The best model, Claude Opus 4.6, found 49% of the backdoors. It also flagged clean binaries as malicious 22% of the time. Cost: $286 per run. Time: 54 minutes.

Read those numbers again. The most capable model on the market catches fewer than half the backdoors and raises a false alarm on roughly one in five clean binaries.

Why this matters more than another benchmark

Supply chain attacks are accelerating. Shai Hulud 2.0 hit thousands of organizations last year, including Fortune 500 companies and government agencies. Notepad++ disclosed a state-sponsored binary hijack just weeks ago. Researchers found hidden radio hardware in Chinese solar inverters and security backdoors in electric bus firmware.

The attack surface is expanding faster than the industry's ability to audit it. The fantasy is that AI will close this gap. BinaryAudit shows that fantasy is premature.

The fundamental problem: binaries aren't code

When you compile a Go or Rust program, the high-level abstractions that make code readable (functions, types, modules, error handling patterns) get flattened into machine instructions. Variable names disappear. Control flow becomes a graph of jump addresses. A backdoor that's obvious in source code becomes a needle in a haystack of register operations.

LLMs are trained on source code. They understand functions, classes, import statements. Binary analysis requires a different kind of reasoning: pattern recognition across raw instruction sequences, understanding calling conventions, tracking data flow through registers and stack frames. The models can use Ghidra's decompiler to reconstruct something resembling source code, but decompiled output is lossy and often misleading.

This is the gap that matters. The models aren't bad at security. They're operating in a domain that doesn't match their training distribution.

The false positive problem is worse than the miss rate

Missing 51% of backdoors is bad. But in a production security workflow, the 22% false positive rate might be more damaging.

Here's how this plays out in practice. You have 100 binaries in your supply chain. You run AI analysis on all of them. You get roughly 22 false alarms on clean software. Each one requires a human reverse engineer to verify. Good reverse engineers cost $200-400/hour and a thorough binary analysis takes days, not hours. You've just generated hundreds of thousands of dollars in verification work, most of which will confirm the software is fine.

Meanwhile, of the binaries that actually contain backdoors, the AI only flagged about half.

mermaid

graph LR
    A[100 Binaries] --> B{AI Analysis}
    B -->|Flagged Malicious| C[~27 flagged]
    B -->|Passed Clean| D[~73 passed]
    C --> E[~5 true positives]
    C --> F[~22 false positives]
    D --> G[~68 actually clean]
    D --> H[~5 missed backdoors]
    style F fill:#ff6b6b
    style H fill:#ff6b6b

The numbers above assume roughly 10% of your binaries are compromised, which is generous for most supply chains. Adjust that down to 1-2% (more realistic) and the signal-to-noise ratio gets much worse.

The cost curve tells you where this is heading

The benchmark includes a cost-vs-accuracy Pareto frontier. Gemini 3 Flash Preview hits 37% detection for $18. Gemini 3 Pro Preview gets 44% for $28. Claude Opus 4.6 reaches 49% for $286. GPT-5.2 manages only 18% but has a 0% false positive rate.

That GPT-5.2 result is interesting. 18% detection with zero false positives means it only flags things it's very confident about. For a triage layer, that's actually more useful than the model that finds more but cries wolf constantly.

This suggests the right architecture isn't "pick the best model." It's a pipeline:

First pass (cheap, high-precision): GPT-5.2 or similar low-false-positive model flags obvious backdoors
Second pass (expensive, high-recall): Claude Opus or Gemini Pro on binaries that passed the first filter
Human verification: Reverse engineers focus on the flagged subset

You're still missing backdoors. But you're spending your human expert budget on the highest-signal targets instead of drowning in false positives.

What I'd actually build today

If I were designing a binary supply chain audit system for a production environment, AI analysis would be one signal among many. Not the primary defense.

The stack I'd trust:

Reproducible builds as the foundation. If you can rebuild a binary from source and diff the output, you don't need to reverse-engineer anything. The Tor Project and F-Droid have proven this works at scale.
Binary diffing against known-good versions. Tools like BinDiff and Diaphora are deterministic and fast. They won't tell you what a change does, but they'll tell you something changed.
AI analysis as a triage layer on binaries that can't be reproduced or diffed. Use the low-false-positive model first.
Human reverse engineering only on high-value targets that survive automated triage.

The critical insight: AI binary analysis is a filter, not a verdict. Treat it like a static analysis tool. You don't ship code just because the linter passed. You don't trust a binary just because the AI said it's clean.

The timeline question

BinaryAudit shows the best model going from effectively zero capability to 49% detection in roughly two years. If that trajectory holds, we might see 70-80% detection rates within 18 months. The false positive rate matters more than the detection rate for production use, and that's harder to extrapolate.

But here's what I'd bet on: AI binary analysis will reach "useful-as-triage" quality before it reaches "trust-it-as-primary-defense" quality. The engineering opportunity is in building the pipeline architecture now, so you can swap in better models as they arrive, rather than waiting for a single model to solve the problem end-to-end.

The companies that build that pipeline infrastructure in 2026 will have a significant advantage when the models catch up. The ones waiting for a silver bullet will still be running strings and hoping for the best.

AI Can't Audit Your Binaries Yet

Why this matters more than another benchmark

The fundamental problem: binaries aren't code

The false positive problem is worse than the miss rate

The cost curve tells you where this is heading

What I'd actually build today

The timeline question

Get new posts in your inbox

Keep reading

Inside Claude Code's Context Machine

AI Made Writing Code Easier. It Made Engineering Harder.

Building Production-Ready MCP Servers