How agent security benchmarks actually score

There are now enough agent security benchmarks that people quote attack success rates the way they quote uptime, as if the number stands on its own. It does not. Two benchmarks can report wildly different attack success rates on the same model and both be correct, because they are testing different attacks. Before you compare, you have to know what you are comparing. This is a short field guide to the six benchmarks that matter and what each one actually claims.

A word on the metric first. Attack success rate (ASR) is the share of attempted attacks that reach their goal. It only means something once you fix four variables: which harness, which class of attack, whether the attacker was allowed to adapt, and how much legitimate utility the agent kept while defending. Change any one and the number moves. Quote it without them and you are quoting noise.

What is the benchmark everyone else measures against?

AgentDojo is the de-facto standard harness. AgentDojo (NeurIPS 2024) provides 97 agent tasks and 629 security test cases, and it became the reference because it embeds the injection inside a realistic task rather than testing prompts in isolation. When a defense paper reports a number, it is usually an AgentDojo number, which is the only reason cross-paper comparison works at all. That is also its scope: task-embedded indirect injection, on its own tool set. It says nothing about attacks aimed at the tool layer itself.

How exposed is a plain tool-using agent?

More than the marketing suggests, and measurably so. InjecAgent assembles 1,054 indirect prompt injection test cases and found a ReAct GPT-4 agent vulnerable to 24% of them. Read that as a floor, not a ceiling: it is a general-purpose agent against a broad injection set, no adaptive tuning. When people say agents are exposed by default, this is one of the cleaner citations for it.

How high can attack success go?

High enough that the range itself is the point. Agent Security Bench (ASB, ICLR 2025) spans more than 400 tools and reports a peak attack success rate of 84.30%. Set that against InjecAgent's 24% and you have the whole problem in two numbers. Same broad category, injection against tool-using agents, and a 60-point spread, because ASB is measuring peak success across many attack configurations while InjecAgent measures one agent's average vulnerability. Neither is wrong. They are answering different questions.

A benchmark score is a claim about a fixed attack set. Read it as "this defense held against these attacks," never as "this agent is safe."

What changes when the attack targets MCP?

The threat model moves from the message to the tool. The Model Context Protocol lets an agent load tools from external servers, and that opens attacks that AgentDojo never modeled. MCPSecBench sorts them into 17 distinct attack types and reports that existing protections stop under 30% of them on average. MCP Security Bench (MSB) widens the aperture again: 2,000 attack instances across 9 agents and 405 tools. These do not produce a single headline ASR you can paste next to AgentDojo, and that is the honest part. They are mapping a surface, not scoring one attack.

What is tool poisoning and why does it score so high?

It is an attack on the tool description itself, and it scores high because the agent trusts that description implicitly. MCPTox studies tool poisoning across 45 real MCP servers, 353 tools and 1,312 cases, and reports a 72.8% attack success rate on o1-mini. The malicious instruction is not hidden in a document the agent reads; it is baked into the metadata of a tool the agent was told to use. No amount of scanning the user's message catches it, because the poison is upstream of the message. This is the attack class that makes "we filter prompts" sound quaint.

agent benchmarks · not one scalescope

1AgentDojo 97 tasks / 629 cases task-embedded

2InjecAgent 24% ASR (ReAct GPT-4) indirect inj.

3ASB 84.30% peak / 400+ tools

4MCPSecBench 17 types, defenses <30%

5MSB 2,000 inst / 9 agents / 405 tools

6MCPTox 72.8% ASR (o1-mini) tool poisoning

Six harnesses, six scopes. The numbers are not on one axis: some report peak ASR, some average vulnerability, some defense coverage across attack types. Comparing them straight is a category error.

So how should a practitioner read a benchmark score?

Extract four things before you let a number carry any weight:

Which harness. An AgentDojo number and an MCPTox number are not on the same axis. Task-embedded injection and tool poisoning are different attacks with different fixes.
Which attack class. Indirect injection, tool poisoning, and the 17 MCP-specific types in MCPSecBench each stress a different part of the stack. A defense can ace one and ignore the others.
Isolated or adaptive. A low score against a fixed attack set is provisional. Ask what an attacker who tunes to the defense does to it.
Utility retained. An ASR near zero is easy if the agent stops doing useful work. The score only counts alongside how many legitimate tasks still complete.

Notice what all six benchmarks share underneath the differences. Every one measures what happens at the point where untrusted input, or an untrusted tool, becomes a privileged action. That is the join they all probe, and it is the one place a control can sit that does not care which of the six attacks arrives. This is why the durable answer is authorization, not detection: a benchmark score tells you how a defense fared against a fixed attack set, while authorization at the boundary constrains the action regardless of the attack's phrasing or origin. The evidence for that, in other people's numbers, is what the defense literature converges on.

The takeaway

These benchmarks are the best instruments we have for measuring agent exposure, and they are worth reading closely. Just do not read a single attack success rate as a grade. It is a claim about one harness, one attack class, one attacker who may or may not have adapted. Line the six up and they do not rank; they map a surface. The thing that holds across the whole map is the control that governs the action instead of guessing at the input.

Oktsec enforces authorization at the tool-call boundary, deterministically, so your security does not rest on whichever attack class the next benchmark decides to measure. See Control →

How agent security benchmarks actually score.

The short version

What is the benchmark everyone else measures against?

How exposed is a plain tool-using agent?

How high can attack success go?

What changes when the attack targets MCP?

What is tool poisoning and why does it score so high?

So how should a practitioner read a benchmark score?

The takeaway

The short version

What is the benchmark everyone else measures against?

How exposed is a plain tool-using agent?

How high can attack success go?

What changes when the attack targets MCP?

What is tool poisoning and why does it score so high?

So how should a practitioner read a benchmark score?

The takeaway

What the research shows about defending AI agents.

Detection versus authorization.