01 Foundations | PromptLeak.ai

1.1LLM Architecture Overview

Large Language Models (LLMs) are neural networks trained on massive text corpora to predict the next token in a sequence. Understanding their architecture is essential for identifying security weaknesses. Modern LLMs like GPT-4, Claude, and Llama share common architectural foundations that create predictable vulnerability patterns.

Figure 1.1.1 — LLM Architecture OverviewArchitecture

LLM processing pipeline: text → tokens → embeddings → transformer layers → probability distribution over vocabulary

Core LLM Components

Tokenizer: Splits text into subword tokens (BPE, WordPiece, SentencePiece). Different tokenizers = different attack surfaces.
Embedding Layer: Maps token IDs to dense vectors (typically 4096-12288 dimensions).
Transformer Blocks: Self-attention + feed-forward networks, stacked 32-96+ times.
Output Head: Projects final hidden states to vocabulary probabilities.
Context Window: Maximum tokens processed together (4K-128K+). Larger = more attack surface.

Model	Parameters	Context Window	Layers	Hidden Size
GPT-4	~1.7T (estimated)	128K	120	12288
Claude 3	Undisclosed	200K	Unknown	Unknown
Llama 3 70B	70B	8K	80	8192
Mistral 7B	7B	32K	32	4096

1.2Transformer Deep Dive

The transformer architecture is the foundation of all modern LLMs. Understanding how attention mechanisms work reveals why prompt injection is fundamentally difficult to prevent—the model has no architectural concept of "trusted" vs "untrusted" tokens.

Figure 1.2.1 — Self-Attention MechanismArchitecture

Self-attention computes relationships between ALL tokens—there's no architectural distinction between trusted and untrusted content

attention.py
PyTorch
# Simplified self-attention - note NO privilege separation
def attention(Q, K, V):
    # Q, K, V all derived from SAME input sequence
    # System prompt and user input are IDENTICAL mathematically
    
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    
    # Softmax over ALL positions - no "trusted" vs "untrusted"
    attention_weights = F.softmax(scores, dim=-1)
    
    # Every token can influence every other token
    # Malicious user token can "override" system token
    output = torch.matmul(attention_weights, V)
    
    return output

1.3Why LLMs Are Inherently Vulnerable

Fundamental Security Limitations

LLM vulnerabilities are architectural, not implementation bugs. They cannot be fully "patched" without fundamental changes to how these models work.

Figure 1.3.1 — Traditional vs LLM Security ModelComparison

Traditional software has deterministic input validation; LLMs process all input through the same statistical interpretation

V1No Privilege Separation

System prompts and user input occupy the same vector space. The attention mechanism sees them as mathematically identical.

There's no "root" vs "user" mode—all tokens are equal citizens in the context window.

V2Statistical Interpretation

Instructions are interpreted probabilistically, not parsed deterministically. "Ignore previous" is processed like any other text.

No regex or grammar validation—the model "understands" instructions through learned patterns.

V3Training Objective Conflict

Models are trained to follow instructions. They cannot distinguish legitimate instructions from injected malicious ones.

The model's helpfulness becomes a vulnerability—it WANTS to follow your instructions.

V4Turing Completeness

LLMs can simulate arbitrary computation, meaning any input filter can theoretically be bypassed given sufficient obfuscation.

If the model can decode Base64 for legitimate uses, it can decode malicious Base64 payloads too.

1.4LLM Attack Surface Mapping

The attack surface of an LLM application extends far beyond the model itself. Every component in the pipeline—from user input to tool execution—presents potential vulnerability points.

Figure 1.4.1 — Complete LLM Attack SurfaceArchitecture

Every layer in an LLM application presents unique attack vectors—comprehensive security requires defense in depth

Attack Surface	Attack Types	Impact	Defense Difficulty
Direct Input	Prompt injection, jailbreaking	Policy bypass, data leak	Very Hard
Indirect Sources	Document/web/email injection	Wormable attacks	Very Hard
RAG/Vector DB	Poisoning, embedding attacks	Persistent compromise	Hard
Tool/MCP Layer	Abuse, SSRF, RCE	System compromise	Medium
Output Layer	Exfiltration, XSS	Data theft	Medium

1.5LLM Threat Modeling

Effective LLM security requires threat modeling specific to AI systems. Traditional frameworks like STRIDE need adaptation for the unique characteristics of language models.

T1Identify Assets

What are you protecting?

• System prompts & business logic • User data & conversation history • Connected systems (tools, APIs, DBs) • Model integrity & availability

T2Identify Adversaries

Who might attack?

• Malicious users (direct access) • External attackers (indirect injection) • Competitors (prompt extraction) • Insiders (training data poisoning)

T3Map Data Flows

Where does untrusted data enter?

• User messages • Uploaded files • Retrieved documents (RAG) • API responses • Web search results

T4Enumerate Threats

What can go wrong?

• Prompt injection (all variants) • Data exfiltration • Unauthorized tool use • Policy/safety bypass • Denial of service

1.6OWASP LLM Top 10 (2025)

The OWASP Top 10 for LLM Applications defines the most critical security risks. Every AI red teamer must understand these vulnerability categories.

Figure 1.6.1 — OWASP LLM Top 10 Risk MatrixFramework

OWASP LLM Top 10 risks ranked by severity—Prompt Injection and Data Exposure are the most critical

1.7LLM Attack Taxonomy

A comprehensive attack taxonomy helps organize the diverse landscape of LLM vulnerabilities into actionable categories for systematic testing.

Category	Attack Types	Primary Target	Module
Injection	Direct, indirect, stored, blind	Control flow	02
Jailbreaking	DAN, roleplay, multi-turn, encoding	Safety filters	03
Extraction	Prompt leak, training data, PII	Confidentiality	04
Agent Abuse	Tool hijacking, goal manipulation	Integrity	05
RAG Attacks	Poisoning, embedding manipulation	Knowledge base	06
Multimodal	Image injection, audio attacks	Vision/audio models	07
Model-Level	Poisoning, adversarial, extraction	Model weights	08

1.8Defense Landscape Overview

Understanding defenses helps red teamers identify bypass opportunities. No single defense is sufficient—defense in depth is essential but still imperfect.

D1Input Filtering

Block known malicious patterns before they reach the model.

Keyword blocklists, regex patterns, ML classifiers Bypass: Encoding, obfuscation, synonyms

D2Output Filtering

Scan model outputs for sensitive data or harmful content.

PII detection, content classifiers Bypass: Encoding output, indirect references

D3Prompt Hardening

Strengthen system prompts against override attempts.

Instruction emphasis, delimiters, few-shot examples Bypass: Context overflow, authority claims

D4Guardrail Models

Secondary models that evaluate inputs/outputs for safety.

Constitutional AI, safety classifiers Bypass: Adversarial examples, edge cases

Red Team Mindset

Every defense creates a new attack surface. Input filters can be bypassed with encoding. Output filters can be bypassed by instructing the model to encode sensitive data. Guardrail models can themselves be prompt-injected. Your job is to find the gaps.