Share article

OpenAI's EVMbench just took a credibility hit after OpenZeppelin said the benchmark is tainted by data contamination, a problem that can make AI models look sharper at finding smart contract bugs than they really are. The catalyst was OpenZeppelin's review of the dataset, which flagged training data leaks and at least four high-severity vulnerability labels that don't hold up. [1]

Enjoy articles without ads?

Register for free and get unlimited access to all articles.

What EVMbench is trying to measure

EVMbench is a blockchain security benchmark designed to test whether AI systems can detect vulnerabilities in Ethereum$1,686.33 Virtual Machine smart contracts. [2] It launched mid-February in partnership with Paradigm, positioning itself as a standard yardstick for "AI auditors" and agentic tooling that claims it can catch reentrancy, access control issues, unsafe external calls, and other classic EVM footguns. [3]
Benchmarks like this matter because the space is moving fast: teams are shipping AI-assisted audit workflows, and plenty of founders want a clean number they can point to on CT. "Model X scored Y on EVMbench" is easy marketing. The hard part is ensuring that Y actually means something.

OpenZeppelin's core claim: contamination undermines the signal

OpenZeppelin, one of the best-known smart contract security shops, says EVMbench's dataset includes data contamination, meaning some portion of the benchmark content overlaps with material that may already be in model training corpora or otherwise accessible in a way that compromises the evaluation. [4]

That overlap is a big deal in practice:

  • If a model has seen the exact contract snippet before, it can "solve" the task by recall rather than reasoning.
  • Even partial overlap (near-duplicates, mirrored repos, repeated patterns) can inflate scores because the model is effectively getting hints.
  • Contamination tends to show up as unnaturally strong performance on specific subsets, and it can make model-to-model comparisons misleading.

OpenZeppelin also pointed to methodological flaws, and notably said it found at least four invalid high-severity vulnerability classifications in the benchmark. That second point is not just academic. If the label is wrong, then "correct" answers become ambiguous, and participants can be penalized for being accurate.

The net effect: contamination plus questionable labels can turn a benchmark into a vibes-based leaderboard.

The label problem: why "high severity" needs to be precise

Smart contract security is brutally sensitive to context. A pattern that is catastrophic in one contract can be benign in another due to:

  • permissions and ownership boundaries
  • upgradeability constraints
  • assumptions about trusted callers
  • invariants enforced elsewhere in the system

So when OpenZeppelin says at least four "high severity" items are invalid, it's a flashing warning light about ground truth. Benchmarks live and die on clean labeling.

A practical way to think about it: if the dataset says "this is an exploitable bug," but it isn't, then an AI system that calls it out is being trained to hallucinate exploits. On the flip side, a model that correctly says "not a vuln" might score worse, pushing teams toward noisier tooling.

That is how you end up with security products that spam false positives, drain engineering time, and still miss the real issues.

Why contamination is especially messy for EVM security tasks

Contamination is a known problem across AI evals, but smart contract benchmarks have a few extra traps:

Public repos are effectively pretraining fuel

A lot of Solidity code is open source, forked endlessly, mirrored across chains, and republished in gists and audits. If EVMbench samples from public code without tight controls, overlap becomes likely. [5]

Vulnerability writeups are widely indexed

When a contract gets exploited, the postmortems and exploit analyses spread everywhere. If EVMbench includes contracts tied to high-profile incidents, the "answer" might already be present in the training data, sometimes explicitly.

Near-duplicate patterns can be enough

Even when the exact file is not present, common templates and repeated anti-patterns can make it difficult to tell whether a model is reasoning or pattern-matching. That is not useless, but it is not the same claim as "the model can audit novel contracts."

Market structure angle: why this matters beyond academic benchmark drama

EVMbench landed in a moment where AI x crypto tooling is fighting for mindshare and budget. Benchmarks are the ammo used to justify:

  • procurement decisions by protocols
  • VC narratives around "AI auditors"
  • dev teams trusting copilots for security-critical code
  • new agent frameworks that promise continuous monitoring
If OpenZeppelin's contamination critique sticks, it changes how much confidence teams should place in EVMbench-driven rankings. Projects with big bags in "AI security" will still market their scores, but buyers will have to ask tougher questions about what those scores represent.

It also impacts how quickly teams should delegate to AI. Plenty of devs already use LLMs to draft Solidity. Handing the same class of model a security sign-off based on a benchmark with leaks is a recipe for misplaced confidence.

What a cleaner benchmark would likely require

OpenZeppelin's findings put pressure on benchmark creators to adopt the same kind of rigor expected in security work. In practice, that usually means:

  • Provenance tracking: clear sourcing for each sample, plus evidence it is not trivially indexed or duplicated.
  • Deduplication and similarity checks: not just exact matches, but near-duplicate detection across repos and datasets.
  • Reproducible splits: strict separation between train-like material and evaluation material, plus documentation.
  • Label adjudication: multiple reviewers, clear severity rubric, and public dispute resolution when labels are challenged.
  • Versioning: benchmarks evolve, so scores should be tied to dataset versions, not a floating target.
None of this is fun, but it is what turns an eval into something the industry can trust.

What to watch next

A few concrete signals will matter over the next weeks:

  1. Will EVMbench maintainers publish a response detailing what was contaminated and how it will be fixed?
  2. Will the dataset be patched and versioned so past scores are clearly separated from new scores?
  3. Will third parties replicate OpenZeppelin's findings, especially on the mislabeled high-severity cases?
  4. Will vendors keep advertising old scores without clarifying the dataset revision?

If fixes land quickly with transparent methodology, EVMbench can still become useful. If the response is hand-wavy, expect the benchmark to become another leaderboard that mainly measures who trained closest to the test set.

Takeaway: treat EVMbench scores as "soft data" until provenance is nailed down

OpenZeppelin's contamination warning is a reminder that AI security evals are easy to game accidentally, even without bad intent. For builders and protocols, the risk is straightforward: over-trusting benchmark results can lead to shipping contracts with a false sense of safety. [6]

Until EVMbench's dataset is demonstrably cleaned and the labeling disputes are resolved, teams should treat high scores as directional at best, then validate with old-school controls: human review, formal methods where feasible, differential fuzzing, and adversarial testing. The thesis that "AI auditors are getting good fast" only holds if the eval is clean, and OpenZeppelin just challenged that assumption with specifics, including training data leaks and multiple high-severity labels that appear wrong.