LLM Security Foundations
Master the fundamentals of LLM architecture, understand why these systems are inherently vulnerable, map the complete attack surface, and learn the OWASP LLM Top 10 risks that every AI red teamer must know.
1.1LLM Architecture Overview
Large Language Models (LLMs) are neural networks trained on massive text corpora to predict the next token in a sequence. Understanding their architecture is essential for identifying security weaknesses. Modern LLMs like GPT-4, Claude, and Llama share common architectural foundations that create predictable vulnerability patterns.
- Tokenizer: Splits text into subword tokens (BPE, WordPiece, SentencePiece). Different tokenizers = different attack surfaces.
- Embedding Layer: Maps token IDs to dense vectors (typically 4096-12288 dimensions).
- Transformer Blocks: Self-attention + feed-forward networks, stacked 32-96+ times.
- Output Head: Projects final hidden states to vocabulary probabilities.
- Context Window: Maximum tokens processed together (4K-128K+). Larger = more attack surface.
| Model | Parameters | Context Window | Layers | Hidden Size |
|---|---|---|---|---|
| GPT-4 | ~1.7T (estimated) | 128K | 120 | 12288 |
| Claude 3 | Undisclosed | 200K | Unknown | Unknown |
| Llama 3 70B | 70B | 8K | 80 | 8192 |
| Mistral 7B | 7B | 32K | 32 | 4096 |
1.2Transformer Deep Dive
The transformer architecture is the foundation of all modern LLMs. Understanding how attention mechanisms work reveals why prompt injection is fundamentally difficult to prevent—the model has no architectural concept of "trusted" vs "untrusted" tokens.
# Simplified self-attention - note NO privilege separation def attention(Q, K, V): # Q, K, V all derived from SAME input sequence # System prompt and user input are IDENTICAL mathematically d_k = Q.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) # Softmax over ALL positions - no "trusted" vs "untrusted" attention_weights = F.softmax(scores, dim=-1) # Every token can influence every other token # Malicious user token can "override" system token output = torch.matmul(attention_weights, V) return output
1.3Why LLMs Are Inherently Vulnerable
Fundamental Security Limitations
LLM vulnerabilities are architectural, not implementation bugs. They cannot be fully "patched" without fundamental changes to how these models work.
System prompts and user input occupy the same vector space. The attention mechanism sees them as mathematically identical.
Instructions are interpreted probabilistically, not parsed deterministically. "Ignore previous" is processed like any other text.
Models are trained to follow instructions. They cannot distinguish legitimate instructions from injected malicious ones.
LLMs can simulate arbitrary computation, meaning any input filter can theoretically be bypassed given sufficient obfuscation.
1.4LLM Attack Surface Mapping
The attack surface of an LLM application extends far beyond the model itself. Every component in the pipeline—from user input to tool execution—presents potential vulnerability points.
| Attack Surface | Attack Types | Impact | Defense Difficulty |
|---|---|---|---|
| Direct Input | Prompt injection, jailbreaking | Policy bypass, data leak | Very Hard |
| Indirect Sources | Document/web/email injection | Wormable attacks | Very Hard |
| RAG/Vector DB | Poisoning, embedding attacks | Persistent compromise | Hard |
| Tool/MCP Layer | Abuse, SSRF, RCE | System compromise | Medium |
| Output Layer | Exfiltration, XSS | Data theft | Medium |
1.5LLM Threat Modeling
Effective LLM security requires threat modeling specific to AI systems. Traditional frameworks like STRIDE need adaptation for the unique characteristics of language models.
What are you protecting?
Who might attack?
Where does untrusted data enter?
What can go wrong?
1.6OWASP LLM Top 10 (2025)
The OWASP Top 10 for LLM Applications defines the most critical security risks. Every AI red teamer must understand these vulnerability categories.
1.7LLM Attack Taxonomy
A comprehensive attack taxonomy helps organize the diverse landscape of LLM vulnerabilities into actionable categories for systematic testing.
| Category | Attack Types | Primary Target | Module |
|---|---|---|---|
| Injection | Direct, indirect, stored, blind | Control flow | 02 |
| Jailbreaking | DAN, roleplay, multi-turn, encoding | Safety filters | 03 |
| Extraction | Prompt leak, training data, PII | Confidentiality | 04 |
| Agent Abuse | Tool hijacking, goal manipulation | Integrity | 05 |
| RAG Attacks | Poisoning, embedding manipulation | Knowledge base | 06 |
| Multimodal | Image injection, audio attacks | Vision/audio models | 07 |
| Model-Level | Poisoning, adversarial, extraction | Model weights | 08 |
1.8Defense Landscape Overview
Understanding defenses helps red teamers identify bypass opportunities. No single defense is sufficient—defense in depth is essential but still imperfect.
Block known malicious patterns before they reach the model.
Scan model outputs for sensitive data or harmful content.
Strengthen system prompts against override attempts.
Secondary models that evaluate inputs/outputs for safety.
Red Team Mindset
Every defense creates a new attack surface. Input filters can be bypassed with encoding. Output filters can be bypassed by instructing the model to encode sensitive data. Guardrail models can themselves be prompt-injected. Your job is to find the gaps.