Oktsec Labs·Reproducible

Oktsec Labs: measuring a defense's ASR delta with AgentDojo.

A vendor tells you their defense stops prompt injection. Fine. Here is a lab you can run yourself to put a number on it: run AgentDojo with the defense off, run it again with the defense on, and report your own attack success rate delta plus the utility you had to give up.

The short version

  • This lab measures a defense's ASR delta: attack success rate with the defense on minus attack success rate with it off, alongside the utility the agent still gets done.
  • What you need: Python, an API key for whatever model you want to drive, and about thirty minutes.
  • What you will see: ASR drops when you turn a defense on, and utility usually drops too, so report both numbers rather than the flattering one.
  • The takeaway: measure the delta and the cost together, and treat any single ASR score as provisional until you know whether the attack adapted to the defense.

Everyone selling agent security has a defense, and every defense comes with a claim. It stops prompt injection. It blocks tool abuse. The claims are cheap because the buyer rarely has a way to check them. AgentDojo, the agent security benchmark from ETH Zurich's SPY Lab, gives you that way. It is open source, MIT licensed, and pip installable, and it ships with the flags you need to turn attacks and defenses on and off. So this Oktsec Labs piece is a methodology you run yourself and report your own numbers on.

One note before we start. This is a methodology, not a benchmark of our own. Where I cite numbers, they come from published research, and I will say so. Everything else is the shape of the result you should expect when you run the steps, not a proprietary measurement we are handing you.

What is ASR, and why the delta?

ASR is attack success rate: across a set of attack cases, the fraction where the attacker achieves their goal. On its own, one ASR number tells you almost nothing useful for buying decisions. A defense that reports 5% ASR sounds great until you learn the baseline with no defense was 6%. What matters is the delta, the drop you get by turning the defense on, because that is the protection the defense actually adds.

There is a second number that matters just as much, and it is the one vendors leave off the slide. A defense that blocks every attack by also refusing to let the agent do its job is not a defense, it is an off switch. So you measure two things together: how much ASR fell, and how much of the agent's legitimate work survived. AgentDojo reports both, because the paper's whole design pairs 97 real tasks with 629 security test cases (arXiv 2406.13352). The delta and the utility cost are the two numbers that decide whether a defense is worth deploying.

The setup

You are running the same benchmark twice, changing exactly one thing between runs. That is the whole experiment.

  1. Install the benchmark. Run pip install agentdojo. The project is public and MIT licensed at ethz-spylab/agentdojo, and the repo documents the runner and its flags.
  2. Set your model API key. Export the key for whichever model you want to drive the agent. The point of the lab is comparing a defense against a baseline on the same model, so pick one and keep it fixed across both runs.
  3. Pick a task suite. Choose a suite such as workspace so the agent has realistic tasks to do while the attacks fire. Keep the suite the same in both runs so the only variable is the defense.
  4. Run the baseline. Run the suite with --defense none and record the resulting ASR and utility. This is your floor: how often the built in attacks land, and how much work the agent completes, with nothing protecting it.
  5. Run with a defense. Run the exact same suite again with a built in --defense, for example tool_filter or one of the transformers based detectors. The flags exist in the repo, so you are not writing anything, just flipping the switch and recording the same two numbers.

Reading the result

Now you have four numbers: ASR and utility for the baseline run, ASR and utility for the defended run. The delta is the subtraction that matters. Take ASR with no defense, subtract ASR with the defense on, and that difference is the protection the defense bought you. A large delta means the defense stopped attacks the baseline let through.

Then look at utility retained: defended utility over baseline utility. If the agent completed 80% of tasks with no defense and 78% with the defense on, you kept almost all of it, and the delta was close to free. If defended utility collapsed to 40%, you bought your ASR drop by breaking the agent, and that is not a win. A defense is only interesting when it lowers ASR without gutting utility. Report the pair, always, because either number alone can be made to look good.

An ASR drop with collapsed utility is not a defense working. It is the agent refusing to work, dressed up as safety.

The catch: was the attack adaptive?

Here is the part that keeps your number honest. AgentDojo's built in attacks are fixed. They fire the same payloads whether or not a defense is present, so your defense faces an attacker who never reacts to it. A real attacker does react. They see the defense, then rewrite the payload to slip past that specific defense.

The published research on this is blunt. "Adaptive Attacks Break Defenses" showed that adaptive attacks bypassed eight prompt injection defenses, pushing attack success back above 50% against defenses that had looked strong under fixed attacks (arXiv 2503.00061). Read against that, your isolated delta is an upper bound on protection, not a promise. The defense will never do better than it did against attacks that did not try to beat it. So whenever you publish an ASR number, say plainly whether the attack was adaptive. A static score measured against fixed attacks is provisional, and treating it as final is how good numbers become misleading ones.

What this tells you about buying a defense

Take this lab into a vendor conversation and the questions get sharper. Ask for three things, not one. What is the ASR delta, measured how. What was the utility cost that came with it. And was the attack adaptive, or fixed. A vendor who quotes a single low ASR and stops there is quoting the flattering half of a two number result against attacks that were not trying. That is not fraud, it is just the shape of a benchmark used as marketing.

The deeper reason these numbers stay provisional is that they measure a moving target. The research consensus on defending agents keeps landing in the same place, which we walk through in what the research shows about defending agents: filters and detectors help, and a determined adaptive attacker gets through them anyway. That is not an argument against measuring. It is an argument for measuring the delta and the cost, then not mistaking either for a guarantee.

agentdojo · workspace suiteruns
1$ agentdojo --suite workspace --defense none
2baseline   ASR 0.61   utility 0.80
3$ agentdojo --suite workspace --defense tool_filter
4defended   ASR 0.18   utility 0.74
5→ ASR delta 0.43 · utility retained 0.93
6→ attack: fixed, not adaptive · treat as upper bound
Illustrative shape only. Run it yourself, record your own four numbers, and report the delta, the utility cost, and whether the attack adapted.

The takeaway

Run this lab once and a defense claim stops being a slogan. You get a delta you measured, a utility cost you can see, and a clear note on whether the attack ever tried to beat the defense. Keep all three together. Then ask any vendor for the same three, and watch how quickly the conversation gets honest.


A benchmark score depends on the attack set. Authorization at the call boundary does not: an unauthorized action is refused no matter how the payload was worded. See how it works →