What the research shows about defending AI agents

There is now enough peer-reviewed work on defending AI agents to stop arguing from intuition and read the scoreboard. The useful move is to line the defenses up on one benchmark, watch what happens when the attacker is allowed to adapt, and see which approaches keep their guarantees. The picture that emerges is consistent, and it happens to be the case we have been making from the field: constraining the action beats inspecting the input.

A note on method before the numbers. Most of these results run on AgentDojo, the NeurIPS 2024 benchmark of 97 agent tasks and 629 security test cases, which is why the figures are comparable at all. Attack success rate (ASR) is the share of attacks that achieve their goal; utility is how many legitimate tasks still complete. A defense is only interesting if it drops ASR without destroying utility.

How well do detection and filtering defenses actually score?

In isolation, surprisingly well, which is why the category is crowded. PromptArmor (Berkeley, 2025) reports attack success, false positives and false negatives all below 1% on AgentDojo. CommandSans (Invariant/ETH) brings ASR from 34% down to 3% with token-level sanitization. Taken at face value, these look close to solved.

They are the high end of a real category, and the numbers are honest for what they measure. The problem is what they measure: performance against a fixed set of attacks. That is the assumption an attacker exists to violate.

What happens when the attacker adapts?

The category collapses. The essential counterweight to every near-zero claim is "Adaptive Attacks Break Defenses" (Kang Lab, NAACL 2025), which took eight published prompt-injection defenses and, by adapting the attack to each, drove attack success back above 50% across the board. Not one held.

A defense that reads the attack is in an arms race with the attacker's phrasing. Every static benchmark is a photograph of one round.

This is the result that reorganizes the whole field. It does not say detection is worthless; it says detection scores are provisional by construction. A 1% ASR against today's attacks tells you very little about the attack someone writes tomorrow specifically to beat it. For a security control you are betting a company on, "provisional" is the operative word.

Which defenses keep their guarantee?

The ones that never try to classify the input. Two lines of work stand out because their guarantee does not depend on recognizing the attack.

CaMeL (DeepMind and ETH, 2025) treats prompt injection as a systems problem rather than a content problem: it extracts a control and data flow from the trusted query and enforces what the agent's code is allowed to do, so untrusted content cannot redirect a privileged action regardless of what it says. It solves 77% of AgentDojo tasks with provable security, against 84% undefended. You give up a little utility and you get a guarantee that does not erode when the phrasing changes.

The second is more direct still. Before the Tool Call: Deterministic Pre-Action Authorization (2026) checks each tool call against an explicit policy before it executes, and reports 0% attack success across 879 attempts, versus 74.6% for the permissive baseline, at 53ms median latency. Treat it as a preprint and weight the authority accordingly, but the shape of the result is the same as CaMeL: decide what the action may be before it happens, and the attacker's cleverness at the input has nowhere to land.

defenses on AgentDojo · a readingliterature

1PromptArmor ASR <1% (isolated) detection

2CommandSans 34% → 3% sanitization

3adaptive attack breaks all 8, >50% ASR ← the catch

4CaMeL 77% tasks, provable authorization

5pre-action authz 0% / 879, 53ms authorization

Detection scores are high but provisional; the guarantees that survive adaptive attack come from constraining the action, not reading the input.

Does this generalize past prompt injection?

The benchmarks say the underlying problem is broad. InjecAgent found a ReAct GPT-4 agent vulnerable to 24% of indirect-injection attacks across 1,054 cases. Agent Security Bench measured peak attack success of 84.30% across 400-plus tools. These are different harnesses reaching the same conclusion: an agent with real tools and untrusted input is exposed by default, and the exposure is at the point where input becomes action. That is precisely the point authorization governs and detection only observes.

What should a practitioner take from this?

Three things the literature supports directly:

Do not buy a defense on a single benchmark score. Ask what an adaptive attacker does to it. If the guarantee depends on recognizing the attack, the score is a photograph, not a promise.
Prefer controls whose guarantee is structural. CaMeL-style flow enforcement and pre-action authorization keep working when the phrasing changes, because they never depended on the phrasing.
Detection still has a job, on top, not underneath. Inside an authorized boundary, detection catches the anomalies your policy did not anticipate. It refines the boundary; it cannot be the boundary.

None of this is our result. It is the direction a year of independent, peer-reviewed work points, and it is the same conclusion we reached auditing agent systems in the field: the durable control is authorization, not detection, enforced before the action and verified after. The research just put numbers on it.

The takeaway

Read the 2025-2026 defense literature as one document and it argues against itself in a productive way: the strongest input-reading defenses post the best isolated scores and lose to adaptive attacks, while the approaches that constrain the action keep their guarantees. If you are choosing where to spend, spend on the boundary that holds when the attacker adapts.

Oktsec enforces authorization at the tool-call boundary, deterministically, so a defense's benchmark score is not the thing standing between an agent and an unauthorized action. See Control →

What the research shows about defending AI agents.

The short version

How well do detection and filtering defenses actually score?

What happens when the attacker adapts?

Which defenses keep their guarantee?

Does this generalize past prompt injection?

What should a practitioner take from this?

The takeaway

The short version

How well do detection and filtering defenses actually score?

What happens when the attacker adapts?

Which defenses keep their guarantee?

Does this generalize past prompt injection?

What should a practitioner take from this?

The takeaway

Detection versus authorization.

Prompt injection is an authorization problem.