1.1LLM Architecture Overview

Large Language Models (LLMs) are neural networks trained on massive text corpora to predict the next token in a sequence. Understanding their architecture is essential for identifying security weaknesses. Modern LLMs like GPT-4, Claude, and Llama share common architectural foundations that create predictable vulnerability patterns.

Figure 1.1.1 — LLM Architecture OverviewArchitecture
INPUT Text / Tokens "Hello, how are" TOKENIZER Text → Token IDs [15496, 11, 703, 527] EMBEDDING Token → Vector [0.23, -0.45, 0.12, 0.89, -0.33, ...] TRANSFORMER N layers (32-96+) Self-Attention Q, K, V matrices Feed-Forward MLP layers Layer Norm + Residual OUTPUT Next Token Probs "you" → 0.89 ⚠ SECURITY INSIGHT All tokens processed identically - no privilege separation between system prompts and user input
LLM processing pipeline: text → tokens → embeddings → transformer layers → probability distribution over vocabulary
Core LLM Components
  • Tokenizer: Splits text into subword tokens (BPE, WordPiece, SentencePiece). Different tokenizers = different attack surfaces.
  • Embedding Layer: Maps token IDs to dense vectors (typically 4096-12288 dimensions).
  • Transformer Blocks: Self-attention + feed-forward networks, stacked 32-96+ times.
  • Output Head: Projects final hidden states to vocabulary probabilities.
  • Context Window: Maximum tokens processed together (4K-128K+). Larger = more attack surface.
ModelParametersContext WindowLayersHidden Size
GPT-4~1.7T (estimated)128K12012288
Claude 3Undisclosed200KUnknownUnknown
Llama 3 70B70B8K808192
Mistral 7B7B32K324096

1.2Transformer Deep Dive

The transformer architecture is the foundation of all modern LLMs. Understanding how attention mechanisms work reveals why prompt injection is fundamentally difficult to prevent—the model has no architectural concept of "trusted" vs "untrusted" tokens.

Figure 1.2.1 — Self-Attention MechanismArchitecture
INPUT SEQUENCE System : Be helpful User : Ignore Q, K, V PROJECTIONS Query Key Value ATTENTION MATRIX Every token attends to every other token System tokens attend to user tokens ⚠ NO PRIVILEGE BOUNDARY Attention treats ALL tokens equally System prompt has no special mathematical privilege
Self-attention computes relationships between ALL tokens—there's no architectural distinction between trusted and untrusted content
attention.py
PyTorch
# Simplified self-attention - note NO privilege separation
def attention(Q, K, V):
    # Q, K, V all derived from SAME input sequence
    # System prompt and user input are IDENTICAL mathematically
    
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    
    # Softmax over ALL positions - no "trusted" vs "untrusted"
    attention_weights = F.softmax(scores, dim=-1)
    
    # Every token can influence every other token
    # Malicious user token can "override" system token
    output = torch.matmul(attention_weights, V)
    
    return output

1.3Why LLMs Are Inherently Vulnerable

Fundamental Security Limitations

LLM vulnerabilities are architectural, not implementation bugs. They cannot be fully "patched" without fundamental changes to how these models work.

Figure 1.3.1 — Traditional vs LLM Security ModelComparison
TRADITIONAL SOFTWARE Trusted Code User Input PARSER Deterministic validation & sanitization ✓ Clear boundaries LLM APPLICATION System Prompt User Input LLM Statistical interpretation No validation ✗ No privilege separation
Traditional software has deterministic input validation; LLMs process all input through the same statistical interpretation
V1No Privilege Separation

System prompts and user input occupy the same vector space. The attention mechanism sees them as mathematically identical.

There's no "root" vs "user" mode—all tokens are equal citizens in the context window.
V2Statistical Interpretation

Instructions are interpreted probabilistically, not parsed deterministically. "Ignore previous" is processed like any other text.

No regex or grammar validation—the model "understands" instructions through learned patterns.
V3Training Objective Conflict

Models are trained to follow instructions. They cannot distinguish legitimate instructions from injected malicious ones.

The model's helpfulness becomes a vulnerability—it WANTS to follow your instructions.
V4Turing Completeness

LLMs can simulate arbitrary computation, meaning any input filter can theoretically be bypassed given sufficient obfuscation.

If the model can decode Base64 for legitimate uses, it can decode malicious Base64 payloads too.

1.4LLM Attack Surface Mapping

The attack surface of an LLM application extends far beyond the model itself. Every component in the pipeline—from user input to tool execution—presents potential vulnerability points.

Figure 1.4.1 — Complete LLM Attack SurfaceArchitecture
USER LAYER Direct Input ⚠ Prompt Injection File Uploads ⚠ Indirect Injection APPLICATION LAYER PROMPT TEMPLATE System: {system_prompt} User: {user_input} ⚠ ⚠ Template Injection MODEL LAYER LLM • Jailbreaking • Data Extraction • Hallucination • Bias Exploitation ⚠ Multiple vectors OUTPUT LAYER Response ⚠ Exfiltration Tool Calls ⚠ RCE / SSRF DATA LAYER RAG / Vector DB ⚠ Poisoning Web Search ⚠ SEO Injection APIs / MCP ⚠ Response Poison Memory ⚠ Persistence
Every layer in an LLM application presents unique attack vectors—comprehensive security requires defense in depth
Attack SurfaceAttack TypesImpactDefense Difficulty
Direct InputPrompt injection, jailbreakingPolicy bypass, data leakVery Hard
Indirect SourcesDocument/web/email injectionWormable attacksVery Hard
RAG/Vector DBPoisoning, embedding attacksPersistent compromiseHard
Tool/MCP LayerAbuse, SSRF, RCESystem compromiseMedium
Output LayerExfiltration, XSSData theftMedium

1.5LLM Threat Modeling

Effective LLM security requires threat modeling specific to AI systems. Traditional frameworks like STRIDE need adaptation for the unique characteristics of language models.

T1Identify Assets

What are you protecting?

• System prompts & business logic • User data & conversation history • Connected systems (tools, APIs, DBs) • Model integrity & availability
T2Identify Adversaries

Who might attack?

• Malicious users (direct access) • External attackers (indirect injection) • Competitors (prompt extraction) • Insiders (training data poisoning)
T3Map Data Flows

Where does untrusted data enter?

• User messages • Uploaded files • Retrieved documents (RAG) • API responses • Web search results
T4Enumerate Threats

What can go wrong?

• Prompt injection (all variants) • Data exfiltration • Unauthorized tool use • Policy/safety bypass • Denial of service

1.6OWASP LLM Top 10 (2025)

The OWASP Top 10 for LLM Applications defines the most critical security risks. Every AI red teamer must understand these vulnerability categories.

Figure 1.6.1 — OWASP LLM Top 10 Risk MatrixFramework
OWASP LLM TOP 10 — 2025 LLM01: Prompt Injection Manipulating LLM via crafted inputs CRITICAL LLM02: Sensitive Data Exposure Leaking PII, secrets, proprietary data CRITICAL LLM03: Supply Chain Compromised models, plugins, data HIGH LLM04: Data Poisoning Manipulating training/fine-tuning data HIGH LLM05: Improper Output Handling XSS, SSRF, RCE via LLM output HIGH LLM06: Excessive Agency Overprivileged tools and actions HIGH LLM07: System Prompt Leakage Exposing hidden instructions MEDIUM LLM08: Vector & Embedding Attacks RAG poisoning, embedding inversion MEDIUM LLM09: Misinformation Hallucinations, false information MEDIUM LLM10: Unbounded Consumption DoS via resource exhaustion MEDIUM SEVERITY LEGEND Critical High Medium
OWASP LLM Top 10 risks ranked by severity—Prompt Injection and Data Exposure are the most critical

1.7LLM Attack Taxonomy

A comprehensive attack taxonomy helps organize the diverse landscape of LLM vulnerabilities into actionable categories for systematic testing.

CategoryAttack TypesPrimary TargetModule
InjectionDirect, indirect, stored, blindControl flow02
JailbreakingDAN, roleplay, multi-turn, encodingSafety filters03
ExtractionPrompt leak, training data, PIIConfidentiality04
Agent AbuseTool hijacking, goal manipulationIntegrity05
RAG AttacksPoisoning, embedding manipulationKnowledge base06
MultimodalImage injection, audio attacksVision/audio models07
Model-LevelPoisoning, adversarial, extractionModel weights08

1.8Defense Landscape Overview

Understanding defenses helps red teamers identify bypass opportunities. No single defense is sufficient—defense in depth is essential but still imperfect.

D1Input Filtering

Block known malicious patterns before they reach the model.

Keyword blocklists, regex patterns, ML classifiers Bypass: Encoding, obfuscation, synonyms
D2Output Filtering

Scan model outputs for sensitive data or harmful content.

PII detection, content classifiers Bypass: Encoding output, indirect references
D3Prompt Hardening

Strengthen system prompts against override attempts.

Instruction emphasis, delimiters, few-shot examples Bypass: Context overflow, authority claims
D4Guardrail Models

Secondary models that evaluate inputs/outputs for safety.

Constitutional AI, safety classifiers Bypass: Adversarial examples, edge cases

Red Team Mindset

Every defense creates a new attack surface. Input filters can be bypassed with encoding. Output filters can be bypassed by instructing the model to encode sensitive data. Guardrail models can themselves be prompt-injected. Your job is to find the gaps.