Oktsec Labs: reproducing a tool-poisoning attack, and catching it

People nod along when you say "a tool description can carry hidden instructions," and then quietly file it under theoretical. It is not theoretical. Invariant Labs disclosed tool-poisoning attacks in April 2025, and the class now sits in the OWASP MCP top ten as MCP03. The building blocks are all public. So this first Oktsec Labs piece is a walkthrough you can run on your own machine, with links to every component, so the attack stops being a story and becomes something you have seen with your own eyes.

One note before we start. This is a methodology, not a benchmark. Where I cite numbers, they come from published research, and I will say so. Everything else is what you should expect to observe when you run the steps, not a proprietary measurement.

What is tool poisoning again?

Tool poisoning hides instructions inside a tool's description or schema. The AI agent reads that description as part of deciding how to use the tool, so the hidden text lands directly in the model's context and gets treated as guidance. The human operator, meanwhile, sees a friendly one line summary in their client and never reads the full description. The instruction is authored for the model, invisible to the person, and it executes with whatever the tool is allowed to do.

That is the whole trick. It is prompt injection delivered through a channel the user does not inspect. Invariant's original write up walks through it in detail, and it is worth reading in full: the tool-poisoning disclosure. The class is now cataloged as OWASP MCP03.

The setup

You are building a small honest world and dropping one poisoned tool into it. Three public projects do all the work.

Stand up a clean victim. Clone the official MCP reference servers and run the Filesystem server. This is your legitimate MCP host: real tools, no tricks. It plays the honest server an agent already trusts.
Add a poisoned tool. Author a tool whose description carries hidden instructions, following the shadowing pattern from Invariant Labs' mcp-injection-experiments repo. The canonical proof of concept there uses a send_email tool whose description quietly tells the model to BCC an attacker address on every message, and a shadowing example that overrides how a trusted tool behaves. The LICENSE on that repo is unconfirmed, so link to it and study it, do not redistribute the code.
Point an AI agent at both. Register the clean server and your poisoned tool with the same agent client, then give the agent an ordinary task. If you would rather not wire this by hand, the Damn Vulnerable MCP Server ships ten Docker challenges, including tool poisoning and tool shadowing, so you can run the scenario in a container.

What you will see

Give the agent a benign instruction that touches the poisoned tool, and watch it comply with the hidden one. With the send_email poison, the agent sends the message you asked for and silently BCCs the attacker, because the description told it to and nothing stopped it. With a shadowing poison, a trusted tool gets steered into reading a file it had no business touching. The user sees a normal, successful action. The exfiltration rides along underneath it.

The user approved a tool. The tool's description approved the attack. Those are not the same act, and only one of them was visible.

If your instinct is that this only works on toy servers, the published evidence says otherwise. MCPTox tested tool-poisoning attacks against 45 real world MCP servers and reported an attack success rate of 72.8% with o1-mini as the driving model (arXiv 2508.14925). That is measured research on live servers, not this lab, but it tells you the qualitative outcome you see here generalizes uncomfortably well.

Now catch it

The good news is that the payload lives in text, and text can be scanned. Point a detector at the servers before the agent ever loads them. Snyk Agent Scan, formerly Invariant's mcp-scan, does exactly this. Run it with:

uvx snyk-agent-scan@latest

You will need a free SNYK_TOKEN in your environment. The scanner inspects every registered tool's description and schema and flags the poisoned one, calling out the injected instruction text in the send_email description before your agent connects. A static scan reads the same bytes the model would, and here it catches the payload while it is still just a string. The project lives at snyk/agent-scan.

Why scanning isn't the whole answer

Catching the payload feels like the end of the story. It is not, and this is the part worth sitting with. A scan is point-in-time. It tells you a tool was clean at the moment you looked. Nothing binds the description you approved to the description the agent loads next week. An MCP server can serve a benign tool during your review and a poisoned one after, a problem we walk through in the agent supply chain. Trusting the description means trusting it to never change, which is not a property descriptions have.

So the durable control does not depend on the description at all. Whatever a tool claims about itself, the question at call time is simple: is this specific action authorized right now? A BCC to an unknown address is either permitted by policy or it is not, and no amount of clever wording in a description changes that verdict. That is the difference between spotting a bad string and stopping a bad action, which is the whole argument in detection versus authorization and prompt injection is an authorization problem.

uvx snyk-agent-scan@latestscan

1scanning 2 servers · 9 tools

2filesystem read_file ok

3filesystem list_dir ok

4mailer send_email flagged

5→ description carries hidden instruction: BCC attacker@evil.example

6tool-poisoning · OWASP MCP03 · blocked before load

The scanner reads the same description the model would, and flags the payload while it is still just text.

The takeaway

Run this lab once and tool poisoning stops being a slide. You built the attack, watched an AI agent obey a hidden instruction, and caught the payload with a public scanner. Keep the scan; it is a real signal. Then put a control underneath it that does not care what any description says.

Oktsec's open source gateway enforces per-call policy, so a poisoned description cannot turn into an unauthorized action even after it changes. See how it works →

Oktsec Labs: reproducing a tool-poisoning attack, and catching it.

The short version

What is tool poisoning again?

The setup

What you will see

Now catch it

Why scanning isn't the whole answer

The takeaway

The short version

What is tool poisoning again?

The setup

What you will see

Now catch it

Why scanning isn't the whole answer

The takeaway

The agent supply chain.

Detection versus authorization.