Oktsec Labs·Reproducible

Oktsec Labs: reproducing a tool-poisoning attack, and catching it.

Tool poisoning is easy to describe and easy to hand-wave. So here is a lab you can run yourself: build the attack, watch an AI agent obey a hidden instruction, then catch the payload with a public scanner before it ever loads.

The short version

  • This lab reproduces a tool-poisoning attack end to end: a poisoned tool description carries hidden instructions, and the AI agent follows them without the user ever seeing the text.
  • You need three public projects: a clean MCP server as the victim, a poisoned tool built from Invariant Labs' proof of concept patterns, and a scanner to catch it.
  • What you will see: the agent steered into an action it was never asked to take, then the same payload flagged by a static scan before load.
  • The takeaway: authorization at the call boundary beats reading tool descriptions, because a scan is point-in-time and descriptions change after you approve them.

People nod along when you say "a tool description can carry hidden instructions," and then quietly file it under theoretical. It is not theoretical. Invariant Labs disclosed tool-poisoning attacks in April 2025, and the class now sits in the OWASP MCP top ten as MCP03. The building blocks are all public. So this first Oktsec Labs piece is a walkthrough you can run on your own machine, with links to every component, so the attack stops being a story and becomes something you have seen with your own eyes.

One note before we start. This is a methodology, not a benchmark. Where I cite numbers, they come from published research, and I will say so. Everything else is what you should expect to observe when you run the steps, not a proprietary measurement.

What is tool poisoning again?

Tool poisoning hides instructions inside a tool's description or schema. The AI agent reads that description as part of deciding how to use the tool, so the hidden text lands directly in the model's context and gets treated as guidance. The human operator, meanwhile, sees a friendly one line summary in their client and never reads the full description. The instruction is authored for the model, invisible to the person, and it executes with whatever the tool is allowed to do.

That is the whole trick. It is prompt injection delivered through a channel the user does not inspect. Invariant's original write up walks through it in detail, and it is worth reading in full: the tool-poisoning disclosure. The class is now cataloged as OWASP MCP03.

The setup

You are building a small honest world and dropping one poisoned tool into it. Three public projects do all the work.

  1. Stand up a clean victim. Clone the official MCP reference servers and run the Filesystem server. This is your legitimate MCP host: real tools, no tricks. It plays the honest server an agent already trusts.
  2. Add a poisoned tool. Author a tool whose description carries hidden instructions, following the shadowing pattern from Invariant Labs' mcp-injection-experiments repo. The canonical proof of concept there uses a send_email tool whose description quietly tells the model to BCC an attacker address on every message, and a shadowing example that overrides how a trusted tool behaves. The LICENSE on that repo is unconfirmed, so link to it and study it, do not redistribute the code.
  3. Point an AI agent at both. Register the clean server and your poisoned tool with the same agent client, then give the agent an ordinary task. If you would rather not wire this by hand, the Damn Vulnerable MCP Server ships ten Docker challenges, including tool poisoning and tool shadowing, so you can run the scenario in a container.

What you will see

Give the agent a benign instruction that touches the poisoned tool, and watch it comply with the hidden one. With the send_email poison, the agent sends the message you asked for and silently BCCs the attacker, because the description told it to and nothing stopped it. With a shadowing poison, a trusted tool gets steered into reading a file it had no business touching. The user sees a normal, successful action. The exfiltration rides along underneath it.

The user approved a tool. The tool's description approved the attack. Those are not the same act, and only one of them was visible.

If your instinct is that this only works on toy servers, the published evidence says otherwise. MCPTox tested tool-poisoning attacks against 45 real world MCP servers and reported an attack success rate of 72.8% with o1-mini as the driving model (arXiv 2508.14925). That is measured research on live servers, not this lab, but it tells you the qualitative outcome you see here generalizes uncomfortably well.

Now catch it

The good news is that the payload lives in text, and text can be scanned. Point a detector at the servers before the agent ever loads them. Snyk Agent Scan, formerly Invariant's mcp-scan, does exactly this. Run it with:

uvx snyk-agent-scan@latest

You will need a free SNYK_TOKEN in your environment. The scanner inspects every registered tool's description and schema and flags the poisoned one, calling out the injected instruction text in the send_email description before your agent connects. A static scan reads the same bytes the model would, and here it catches the payload while it is still just a string. The project lives at snyk/agent-scan.

Why scanning isn't the whole answer

Catching the payload feels like the end of the story. It is not, and this is the part worth sitting with. A scan is point-in-time. It tells you a tool was clean at the moment you looked. Nothing binds the description you approved to the description the agent loads next week. An MCP server can serve a benign tool during your review and a poisoned one after, a problem we walk through in the agent supply chain. Trusting the description means trusting it to never change, which is not a property descriptions have.

So the durable control does not depend on the description at all. Whatever a tool claims about itself, the question at call time is simple: is this specific action authorized right now? A BCC to an unknown address is either permitted by policy or it is not, and no amount of clever wording in a description changes that verdict. That is the difference between spotting a bad string and stopping a bad action, which is the whole argument in detection versus authorization and prompt injection is an authorization problem.

uvx snyk-agent-scan@latestscan
1scanning 2 servers · 9 tools
2filesystem   read_file    ok
3filesystem   list_dir     ok
4mailer       send_email   flagged
5→ description carries hidden instruction: BCC attacker@evil.example
6tool-poisoning · OWASP MCP03 · blocked before load
The scanner reads the same description the model would, and flags the payload while it is still just text.

The takeaway

Run this lab once and tool poisoning stops being a slide. You built the attack, watched an AI agent obey a hidden instruction, and caught the payload with a public scanner. Keep the scan; it is a real signal. Then put a control underneath it that does not care what any description says.


Oktsec's open source gateway enforces per-call policy, so a poisoned description cannot turn into an unauthorized action even after it changes. See how it works →