Can Frontier LLMs Audit Smart Contracts?
An internal evaluation: 6 frontier models, 14 real audit-fix commits from our own protocol, 40 rubric findings, honest numbers.
Why this exists
I lead security at EigenLabs. One of the questions I keep getting from other security engineers, tooling vendors, and protocol teams thinking about their own pipelines is some variant of: can a frontier LLM actually audit a smart contract? If so, which one is best? And what happens if I just wire one into my review process?
The honest answer has been “I don’t know; the public benchmarks don’t look much like real audit work.” So I built the evaluation I wish existed and ran it against our own repo.
The setup: I took 14 merged audit-fix commits from Layr-Labs/eigenlayer-contracts, checked out the vulnerable parent of each, and asked six frontier LLMs to audit them blind. Then I graded every produced report against the per-commit rubric I derived from the actual fix. These are bugs we already found, patched, and shipped, so I know exactly what a good auditor should flag.
The short version: the best model catches a bit more than half the findings, the worst catches a quarter, and every single model reports 2-9 “findings” per task that don’t correspond to anything in the rubric. The rest of this post is about what worked, what didn’t, and what I’m building next.
The dataset
I picked 14 merged audit-fix commits from eigenlayer-contracts that satisfied three properties:
- The fix addresses one or more security-relevant bugs (informational through critical).
- The bugs were flagged during a real audit, either by an external firm we work with (primarily Certora) or by our own security engineers, with a commit message that pointed clearly at the finding.
- The vulnerable parent state compiles and runs against the same
foundry.toml.
Picking from our own repo has an obvious advantage: I know what the bugs actually were, I wrote or reviewed the fixes, and I can grade the LLM’s output against the same standard a real EigenLayer audit would apply. The obvious disadvantage is that I’m grading tools against a protocol I know well, so the numbers here say something about EigenLayer-style code, not about smart contracts in general. I’ll flag where that matters.
Each task is a directory containing:
task.toml: metadata (repo,vulnerable_sha,fix_sha, feature name, timeouts, resource limits)instruction.md: the prompt the agent seesDockerfile: reproducible environment (Foundry,rg,grep, Python) with the vulnerable source mounted and.gitstripped so the agent cannot cheat by reading prior commitstests/: the harness verifier config that checks “did the agent produceREPORT.md”
The 14 slugs:
allocation-delay-on-register, bn254cv-nonsigner-ordering, duration-vaults, merkle-leaf-salt, merkle-library, protocol-registry, redistribution-burn-check, reentrancy-checks, release-manager, rewards-v2, slash-escrow-oog, slasher-instant-effect, slashing-commitments, task-mailbox.
For every task I hand-wrote a rubric enumerating each distinct bug the fix addresses, the rubric criteria (“agent must name the function and describe a compatible exploit scenario”), and which findings are “do not count” (e.g. NatSpec-only items where missing them is not a capability signal).
Total rubric items across 14 tasks: 40. Two are explicitly flagged “do not count” in the rubric itself (merkle-library I-04, duration-vaults I-05), leaving 38 effective ground-truth items.
The harness
Containerized isolation
Every trial runs in a fresh Docker container. The container has the vulnerable source tree but no git history, so the agent must reason from the source itself, not git log. It has read-only Foundry tooling and ripgrep so it can search and compile if it wants, and a single writable file: REPORT.md.
I built on top of two existing frameworks:
- Harbor (from the Terminal-Bench authors) handles the per-trial Docker lifecycle, the verifier step, and job-level orchestration. Harbor runs one container per
(task × model × trial)cell. - Pi (
@mariozechner/pi-coding-agent) is a protocol-flexible coding agent with adapters for OpenAI-compatible providers, Anthropic direct, and Bedrock. Pi emits a JSONL event stream per trial.
A thin custom Harbor adapter lets Harbor invoke Pi with arbitrary provider/model combinations and injects provider config: prices, context windows, and max_tokens ceilings. That last one bit me badly; more on it later.
The prompt
The same prompt template runs for every model:
You are a senior software security auditor. Scope: the
{{FEATURE_NAME}}feature and the surrounding code it interacts with. Treat{{FEATURE_NAME}}as a logical subsystem that may span multiple files, interfaces, and tests. Your first job is to locate the relevant code.
The prompt is deliberately directed but unscoped: the agent gets the feature’s natural-language name but no file paths, no severity hints, no suggestions about what the bugs might be, and no count of how many findings to look for. The agent decides what’s a bug. I explicitly instruct the model to not invent findings, and to say “I reviewed X and believe it is correct” as an acceptable answer if that’s what it concludes.
This matches real-world conditions: auditors get a scope description, not a bug list.
The models
Six frontier LLMs, all at their highest reasoning setting:
| Model | Provider | Context | Output cap | Concurrency |
|---|---|---|---|---|
| Claude Opus 4.7 (extended thinking, xhigh) | Anthropic direct | 1M | 128k | 4 |
| GPT-5.5 | OpenAI direct | 400k | 128k | 14 |
| Kimi K2.6 | Fireworks | 262k | 32k | 4 |
| MiniMax M2.7 | Fireworks | 196k | 196k | 4 |
| DeepSeek V4 Pro | Fireworks | 131k | 131k | 3 + 4 |
| GLM 5.1 | Fireworks | 131k | 131k | 4 |
Concurrency is per-job, empirically calibrated: GPT-5.5 on OpenAI tolerated 14 parallel streams cleanly; Fireworks models rate-limited at 14 (MiniMax’s first attempt had 12/14 trials 429), so I settled at 4 for all Fireworks providers.
Every trial ran once. No pass@k, no retries on capability grounds. Retries only for infrastructure failures (stream terminations, rate limits), and those are flagged as caveats.
Output capture
Every agent emits a multi-GB JSONL event stream (every message_update delta). I compacted these post-hoc with a script that drops token-level deltas but keeps all message_end, tool_execution_*, and turn_* events. Compression ratio: ~1000-12000× depending on the model (MiniMax is verbose, GPT-5.5 is terse). Total trace storage across 6 models × 14 tasks: ~23 MB compacted from ~19 GB raw.
Grading
I graded manually, using Claude Opus 4.7 as the reader, scoring each agent’s REPORT.md against the rubric one finding at a time. Three possible calls per rubric item:
- Hit: agent identifies the same underlying bug. Paraphrased wording is fine; naming the same file+function and describing a compatible exploit/failure mode counts.
- Partial: adjacent but not exact (e.g. describes the right bug but locates the fix in the wrong function; flags the symptom without the mechanism).
- Miss: no correspondence.
A fourth state, no report, tracks trials that didn’t produce a REPORT.md.
Partials were the hardest grading call, so I later re-audited every single one against the rubric’s explicit hit-criteria. Four originally-partial assignments turned out to be different bugs entirely and were demoted to miss; two were arguably clean hits I downgraded to partial for consistency with other models. Transparent per-finding reasoning is in the partial-credit ledger in the full artifacts.
Self-grading bias: Claude Opus 4.7 read Opus 4.7’s own reports. I flagged this in the caveats and re-graded Opus strictly against rubric wording, which resulted in several downgrades from hit to partial. I did not re-grade the other five models as strictly, so a symmetric re-audit could shift several numbers by ±1-2. This is the largest methodological caveat in the benchmark.
Headline results
Across the 38 effective rubric items, counting hits only:
| Rank | Model | Hits / 38 | Hit-rate | Partials | Zero-hit tasks |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 20 | 53% | 4 | 1 |
| 2 | GPT-5.5 | 18 | 47% | 2 | 3 |
| 3 | Kimi K2.6 | 16 | 42% | 2 | 4 |
| 4 | MiniMax M2.7 | 15 | 39% | 2 | 4 |
| 5 | GLM 5.1 | 14 | 37% | 3 | 6 |
| 6 | DeepSeek V4 Pro | 13 | 34% | 3 | 5 |
One universally-missed finding: task-mailbox H-03, the only High-severity item in the benchmark (MAX_TASK_SLA vs deallocation delay). Zero models caught it. Union coverage across all six models on hits only: 23/38 = 61%.
Severity breakdown
Categorizing each rubric item by severity (High / Medium / Low / Informational / Other, where Other covers reentrancy, NatSpec-tagged, or bug-prefix items with no explicit severity):
| Severity | Items | DeepSeek | GLM | Kimi | MiniMax | GPT-5.5 | Opus 4.7 |
|---|---|---|---|---|---|---|---|
| High | 1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 |
| Medium | 4 | 1/4 | 1/4 | 2/4 | 2/4 | 3/4 | 3/4 |
| Low | 7 | 2/7 | 3/7 | 3/7 | 2/7 | 2/7 | 5/7 |
| Informational | 20 | 7/20 | 7/20 | 8/20 | 8/20 | 10/20 | 9/20 |
| Other | 6 | 3/6 | 3/6 | 3/6 | 3/6 | 3/6 | 3/6 |
The severity decomposition is the most interesting part of the numbers:
- The single High went 0/6. Nobody found task-mailbox H-03. This is a real “frontier models can’t find highs yet” signal, though n=1 is too small to generalize.
- Medium is a cleanly tied top tier. GPT-5.5 and Opus both at 3/4, missing only the same one (
duration-vaults M-03, theupdateDelegationApproverrotation path). DeepSeek and GLM are stuck at 1/4. - Low is where Opus pulls ahead. 5/7 on Low is a structural bug-spotting lead, not a volume win.
- Informational is where GPT-5.5 leads outright. 10/20 versus Opus’s 9/20. This is the band where surface pattern recognition (missing events, wrong error names, NatSpec gaps) dominates over deep reasoning.
- “Opus leads overall” decomposes into “Opus leads on Low, GPT-5.5 leads on Info, tied on Medium.” A much more useful insight than the single-number ranking.
False positives: the loudest failure mode
For every model I counted ^## headings in each REPORT.md (filtering out boilerplate like “Scope” and “Summary”), then subtracted the findings that matched a rubric item. The residual is “unmatched findings”:
| Model | Reported | Matched | FP | FP rate | Precision | FPs / task |
|---|---|---|---|---|---|---|
| GPT-5.5 | 54 | 16 | 38 | 70% | 30% | 2.7 |
| Kimi K2.6 | 74 | 17 | 57 | 77% | 23% | 4.1 |
| DeepSeek V4 Pro | 81 | 17 | 64 | 79% | 21% | 4.6 |
| GLM 5.1 | 84 | 18 | 66 | 79% | 21% | 4.7 |
| MiniMax M2.7 | 89 | 16 | 73 | 82% | 18% | 5.2 |
| Opus 4.7 | 147 | 24 | 123 | 84% | 16% | 8.8 |
Two nuances:
- “FP” here means “unmatched against this rubric.” Some unmatched findings may be real bugs the original audit didn’t flag. The rubric is derived from the audit-fix commit, so by construction it doesn’t contain bugs the auditors missed. I did not independently research every unmatched finding; this is a lower bound on true-precision, not an upper bound.
- Heading counting is heuristic. Opus tends to nest sub-headings under
##; other models use### Finding N. Raw counts are stable run-to-run but cross-model comparable to ±1-2. Precision rankings don’t change under reasonable counting-rule shifts.
The headline tension is clear: Opus has the highest recall but the worst precision, reporting ~3× as many findings as GPT-5.5 per task. Opus gives you more real bugs and more noise. GPT-5.5 sits on the Pareto frontier for any use case where reviewer time is the bottleneck.
Economics
| Model | Trials | Wall time | Input tokens | Output tokens | Cost recorded |
|---|---|---|---|---|---|
| GPT-5.5 | 14 | 9m 33s @ n=14 | 12.9M | 193k | ~$29.63 (est.) |
| MiniMax M2.7 | 14 | 16m 10s @ n=4 | 15.5M | 189k | $0 (Fireworks unpriced) |
| DeepSeek V4 Pro | 14 | 33m 28s @ n=3-4 | 12.0M | 322k | $7.73 |
| Kimi K2.6 | 14 | 46m 49s @ n=4 | 37.3M | 664k | $0 (Fireworks) |
| Opus 4.7 | 14 | 51m 25s @ n=4 | 29.5M | 584k | $39.63 |
| GLM 5.1 | 14 | 1h 23m @ n=4 | 24.2M | 454k | $0 (Fireworks) |
Cost numbers are unreliable for four of six models: the adapter didn’t hardcode Fireworks per-model pricing, so the three Fireworks models report $0. Real billing is on the Fireworks dashboard. GPT-5.5’s figure is hand-computed at assumed rates ($5/M non-cached input, $1.25/M cached, $40/M output); OpenAI hasn’t published GPT-5.5 prices yet so treat as ±25%. Only DeepSeek’s $7.73 and Opus’s $39.63 are adapter-recorded against the provider’s published rates.
What I can say with confidence: on the 2 models where I have both cost and hit-rate, cost-per-hit is $0.59 for DeepSeek V4 Pro and $1.98 for Opus 4.7. Opus’s lead on hits is a 3.4× cost multiplier. Whether it’s worth it depends on how expensive your downstream human reviewer is.
The long tail of what went wrong
The headline numbers elide a lot of operational pain. A few issues worth documenting because they’ll bite anyone trying to reproduce this setup:
GLM’s maxTokens: 25344
The original GLM 5.1 sweep failed 3/14 trials with no REPORT.md produced. Root cause: the Fireworks provider config copied max_tokens from a sample payload (25,344) instead of using the model’s native ceiling (131,072). Three tasks wrote output long enough to hit the 25k cap mid-report, at which point the agent terminated without a final write. The model was capable; the harness was silently truncating it.
Caught by looking at n_output_tokens in the failing trials’ result.json. They clustered tightly at 26,246 / 26,382 / 26,886, just above the 25,344 cap. One-line fix; re-running with maxTokens: 131072 recovered all three trials. GLM went from 11/38 (29%) to 14/38 (37%), overtaking DeepSeek.
Lesson: when a new provider adapter ships, check the defaults in the reference payloads; they’re often suspiciously low.
MiniMax and Fireworks rate limits
MiniMax at n_concurrent=14 tripped Fireworks’ per-account rate limits on the first attempt. 12 of 14 trials failed with 429s in the first 3.5 minutes. Retried at n=4 (16m 10s, clean). The leaderboard counts the retry.
Bedrock’s empty-stop bug for Opus 4.7
Opus 4.7 initially ran via Bedrock because Anthropic direct was billing-constrained. Bedrock’s adapter in Pi emits a default stopReason: "stop" with empty content when Bedrock’s stream closes without any events, and the agent-loop treats that as a clean termination and exits. Across three attempts: 3/14, 1/14, 0/14 reports. Diagnosed down to a specific line in pi-ai/dist/providers/amazon-bedrock.js (a default stopReason initialization that’s never overwritten when the stream has zero events). Switching to Anthropic direct resolved it, and Opus 4.7 produced 14/14 reports on the first try.
Fireworks stream terminated
Two trials in GLM’s re-run died with errorMessage: "terminated" from Fireworks mid-response. Infrastructure flake, not a capability or config issue. Both recovered on per-task retries. The kind of noise you have to budget for when running at n=4 across multiple providers.
Self-grading bias
Claude Opus 4.7 graded Opus 4.7’s own reports. A strict re-grade against the rubric’s exact hit-criteria caught several findings that were rubric-adjacent but not quite on-target (notably rewards-v2 B-02, where Opus described a related activatedAt == 0 branch of the same helper instead of the oldSplitBips snapshot branch the rubric demands). Those were downgraded to partial or miss. The other five models were not re-graded to the same strictness, so a symmetric re-audit could shift several numbers by ±1-2. Hard to fix without a truly independent grader.
What I’m building next
The leaderboard is useful as a baseline, but the FP problem is the headline that matters for any downstream use. Opus’s 123 unmatched findings per sweep isn’t usable output: someone has to read each one and decide whether it’s a real bug the audit missed or a plausible-sounding hallucination. The signal-to-noise ratio puts the tool in “interesting research artifact” territory, not “ready for production audit workflows.”
I’m prototyping a verification layer that turns prose findings into runnable proofs. The short version:
- Per-task, spin up a sandbox anvil container on a private Docker network.
- Fork mainnet at a block whose timestamp is close to the
vulnerable_sha’s commit date, so the chain state matches the era the audit was working in. - Overlay the vulnerable commit’s bytecode onto the mainnet proxies via build-time
anvil_setCode(blocked at trial time; the wholeanvil_*namespace is denied from the agent through an RPC-method allowlist proxy). - Snapshot the world and bake it into the verifier image.
- After an agent produces its prose report, a frontier exploit-generator (e.g. Opus 4.7 or GPT-5.5) gets the report, the source tree, the deployed ABIs, and RPC access to the verifier. It tries to produce a working Solidity PoC that reproduces each claimed bug on the forked chain, through RPC only. No cheatcodes, no storage mutation, no account impersonation. Just
eth_sendRawTransactionand friends. - The judge runs the PoC and emits
proven/unprovenper finding.
The key design choice is splitting the bug-finding skill axis from the exploit-writing skill axis. The auditor model (whatever is being benchmarked) writes prose. The exploit-generator (always a frontier model) writes Solidity. This way, weaker models aren’t double-penalized for not-yet-having-mastered-Solidity on top of not-yet-having-mastered-auditing.
Scoring under the new regime becomes two-track:
- Proven on-chain + matches rubric → clean TP.
- Proven on-chain + doesn’t match rubric → “unlabeled TP”, a real bug the original audit didn’t catch. Flag for human review.
- Unproven + matches rubric → still counts. Many rubric items are NatSpec/docs/events and will never have a PoC.
- Unproven + no rubric match → FP.
The cheatcode-blocking RPC proxy is the entire security boundary. Without it, an agent could fabricate preconditions (anvil_setStorageAt), impersonate admins, mint balances, or rewind-loop until an exploit appears. The allowlist is small enough to audit by eye (eth_call, eth_sendRawTransaction, eth_getTransactionReceipt, a handful of inspection RPCs, and rate-limited evm_mine). Everything else denied.
V0 is scoped to one task: slash-escrow-oog, because the O-01 gas-DoS bug is cleanly exploitable on-chain. It ships with two hand-written fixtures (valid/ and invalid/) to validate the judge before pointing LLMs at it.
If v0 shows meaningful precision improvement on the exploitable subset (targeting something like 16% → 70% precision on that track), I’ll build out the remaining 13 tasks. If it doesn’t, I’ll pivot to a lighter-weight citation-verifier that just machine-checks that every file path and function name an agent cites in REPORT.md actually exists in the source tree. Won’t catch logic hallucinations but it’ll catch “this function doesn’t exist in the commit the agent reviewed,” which is a surprisingly common failure mode on the weaker models.
What this experiment is and isn’t
It is:
- A reproducible, public-dataset benchmark on 14 real-world smart-contract bugs.
- An honest head-to-head of 6 frontier LLMs at their highest reasoning settings.
- A decomposition of “which model is best” into “best at what severity of bug” and “at what precision cost.”
- Evidence that the FP problem, not the recall problem, is the bottleneck for LLM-driven auditing today.
It isn’t:
- Variance-estimated. Every cell is a single trial. Pass@k is unknown. A re-run could shift individual cells by a few points.
- Perfectly graded. Self-grading bias is present on Opus, partial-vs-hit calls are subjective, and I re-graded strictly only in one direction.
- A real audit. Real audits have scoping conversations, severity negotiations, multi-round back-and-forth with the protocol team, and a fix-verification pass. This measures a single-shot “here’s a scope, go find bugs” task, which is a necessary but insufficient condition for “can this model replace an auditor.”
- A generalized result. All 14 tasks are from one protocol. I don’t know whether these numbers transfer to Aave, Uniswap, LayerZero, or general-purpose DeFi.
Acknowledgements
The dataset is built on the security reviews I led on these commits, both my own work and the reviews we ran with external partners (Certora especially). The rubrics are derived directly from the findings we triaged, accepted, and patched on each commit. This post measures how well today’s LLMs rediscover the bugs that human review already found on the same code. The baseline isn’t hypothetical; it’s the standard we actually shipped to.
The gap between frontier models and that standard is real, the gap is narrowing, and the specific failure modes are useful to characterize honestly.
The most valuable next step for the field isn’t another model on this benchmark. It’s other protocol teams running the same experiment on their own repos, so we can tell whether “Opus leads on Low-severity logic bugs, GPT-5.5 leads on Informational pattern-matching” is a stable property of these models or just an artifact of how EigenLayer-shaped code distributes bugs. If you’re a protocol security lead and want to do this on your own commits, I’m happy to share the harness.
I’ll post again when the verification layer is running.