Building an AI Security Lab You Can Actually Run

Most resources for learning AI security stop at the definition. You read that prompt injection exists, you nod, and you move on — without ever watching an injected instruction actually take over an agent in front of you. That’s a problem, because you can’t build real intuition for an attack class you’ve only read about. The mechanics matter. The failure modes matter. And they only become obvious when you make the attack happen yourself.

So I built a lab that does exactly that. It’s a set of intentionally vulnerable services you run locally, attack with your own hands, and then defend. No API keys. No tokens to buy. No cloud account. Everything runs on your laptop against a small open model, which means the barrier to actually trying the attacks is close to zero.

This post is about what’s in the lab, the way I think about the attack surface, and why I structured it the way I did. The code is on GitHub.

The interesting bugs aren’t in the model

Here’s the reframe that the lab is built around: AI security is mostly not about the model. It’s about what happens when a system combines untrusted input with real capability at the same time.

A model that can only chat is a low-stakes target. A model that can read your email, query your database, call your internal tools, and act on what it reads — that’s a different animal. The moment an attacker can influence what the model sees, and the model can take consequential actions, you have an attack surface. The model is just the confused deputy in the middle.

That framing turns a vague topic into something you can actually map. Instead of “AI is insecure,” you get a small number of concrete trust boundaries, each of which can fail:

Input trust — does the system treat user input as instructions it should obey?
Context trust — does it trust the documents, pages, and data it retrieves?
Tool trust — does it trust the descriptions and results of the tools it can call?
Session trust — does it trust that the conversation so far reflects legitimate, established context?

Every module in the lab is an attack on one of these four boundaries. That’s the whole structure.

The five attack classes

Direct prompt injection (input trust)

The simplest case. The system has a hidden system prompt with rules — don’t reveal secrets, don’t discuss certain topics — and you, the user, try to override those rules directly from your input. You ask it to translate its instructions, to role-play as a different assistant, to summarize everything it was told before the conversation started.

The reason this works at all is that the model sees the system prompt and your message as one continuous stream of text. There is no built-in notion that the system’s instructions outrank yours. Reframing (“translate it,” “summarize it,” “you are now DebugBot”) routinely slips past the intended boundary. The defense has to be added at the application layer — it isn’t free, and the model won’t enforce it for you.

Indirect prompt injection (context trust)

This is the one that should worry you more. Here the malicious instructions don’t come from the user at all — they’re hidden inside content the agent retrieves. A web page it summarizes. A document it reads. An email it processes.

In the lab, you point a summarizer at pages that look completely innocuous but carry injected instructions in HTML comments, in CSS-hidden text, in image alt attributes. The summarizer dutifully extracts all of it — visible text and hidden text alike — and hands it to the model, which can’t tell “content to summarize” from “instructions to follow.” You watch the summary turn into something the page author chose, not something the actual content supports.

This is the core danger of any agent that reads untrusted external content, which is to say: most useful agents.

Review and RAG poisoning (context trust, at scale)

Two modules, same boundary, more realistic delivery. In one, an AI shopping assistant reads product reviews to make a recommendation — and some of those reviews contain instructions disguised as ordinary customer text. In the other, a Retrieval-Augmented Generation helpdesk pulls from a knowledge base that has a single poisoned document planted in it, styled to look like an official policy update.

What makes these compelling is that the attack doesn’t look like an attack. It’s just data — user-generated content, or a document in a vector store — and the pipeline treats every retrieved item as equally trustworthy. The model has no way to distinguish the legitimate policy from the planted one once they’re both sitting in its context. The real defense isn’t at the model; it’s provenance and access control on the data the model is allowed to retrieve.

Tool-chain exfiltration (tool trust + capability)

This is the showcase, because it’s the one where the abstract risk becomes viscerally concrete.

You open an attacker dashboard in one window — it starts empty. In another, you have an AI assistant that can use tools: read messages, send messages. You have it process a single poisoned email. Hidden in that email are instructions telling the assistant to forward data to an external address using its send_message tool. Then you switch to the dashboard and watch the stolen data land, in real time, as the assistant chains a read into a send that you never asked for.

Nothing about this requires an exotic model exploit. It requires exactly two ingredients: untrusted input (the email body, which the assistant trusts because it came from a tool result) and real capability (the ability to send). Put those together with no independent check on the action, and a single document becomes a working exfiltration chain.

Multi-turn context manipulation (session trust)

The most underappreciated of the set. Instead of one obvious malicious prompt, the boundary is eroded gradually across many turns — false authority claims (“I’m the new admin”), fabricated policy updates, small escalations. No individual turn looks like an attack. But the model treats the conversation so far as established truth, and once it has “agreed” to something earlier — even under manipulation — it anchors to that and the next ask gets easier.

This is the attack that static, single-turn input filters completely miss, and it’s why production agents need stateless policy checks that evaluate every response against the original rules, independent of whatever the conversation has drifted into.

Underneath the agent: the tools are just software

The lab also includes a module that drops below the AI layer entirely and attacks an insecure tool server directly with curl. Path traversal, command injection, missing authorization — classic AppSec bugs, the kind we’ve known how to find for twenty years.

The point of including it is to make something explicit: when you give an LLM access to a tool, the tool’s vulnerabilities become reachable by anyone who can steer the model. The AI becomes a remote trigger for the same old bugs. If your file_read tool has a path traversal, it doesn’t matter that there’s a polite language model in front of it — an attacker who can inject instructions can walk your filesystem. AI security and ordinary application security are not separate disciplines here. The AI just changes who gets to pull the trigger.

What actually defends against this

The lab pairs every attack with the defenses that hold up, and a final module puts a policy-enforcing gateway in front of the tools so you can see what it blocks — and where it falls short. The honest summary:

Treat all retrieved content as untrusted. Strip it, delimit it, and tell the model explicitly that it’s data, not instructions. Never pass raw HTML to a model.
Put provenance and access control on your data sources. A vector store is a privileged data store, not a public index. Most RAG poisoning dies here.
Require independent verification for consequential tool actions. Send, write, delete, pay — these should not fire on the model’s say-so alone. The check doesn’t have to be a human: it can be a separate policy service, an out-of-band confirmation, or a second process that has to approve the transaction before it executes. A human in the loop is the simplest version, but the principle is broader — something outside the model’s manipulable context has to agree. The exfiltration module simply doesn’t work if the send requires that second approval.
Enforce policy statelessly. Re-assert the system prompt every turn and validate every response against the original rules, regardless of conversation history. This is what stops the multi-turn drift.
Don’t trust pattern matching to save you. The gateway module shows naive keyword filters getting bypassed with trivial encoding. Defense in depth — gateways help, but they are not a control you get to rely on alone.

None of this is exotic. Most of it is recognizable to anyone who’s done application security. That’s the reassuring part and the uncomfortable part at once: we mostly know how to defend these systems, we just have to actually do it, consistently, on a surface most teams haven’t finished mapping.

Run it

The whole thing comes up with docker compose up. It pulls a small model automatically, and you work through the modules from a browser — a step-by-step guide walks you through each attack, with a full answer key at the end if you get stuck.

The repository is here: github.com/michaellakav/intro-to-ai-security.

Clone it, break it, and then defend it. That last step — defending what you just broke — is where the understanding actually lands.

Where this goes next

The attacks in this lab are, frankly, the easy part of what’s coming. As we hand models more autonomy, longer-running tasks, and the ability to act across multiple systems on our behalf, the trust boundaries multiply and the blast radius of a single injection grows. The exfiltration module is a toy. The real version, in a production agent with broad tool access and no confirmation step, is not.

The good news is that the discipline transfers. If you can reason about untrusted input and capability separately, and about the four trust boundaries between them, you have a framework that scales from a toy lab to a real agentic system. The model will keep changing. The boundaries won’t.