Model-Level Attacks
Advanced attacks targeting the model itself: training data poisoning, backdoors, adversarial examples, model extraction, and supply chain attacks.
9.1Training Data Poisoning
Supply Chain Attack
Training data poisoning is a supply chain attack that affects all users of a model. Malicious data injected during training persists in model weights and cannot be removed without retraining.
Training data poisoning involves injecting malicious examples into datasets to alter model behavior. Unlike prompt injection (runtime), poisoning attacks happen at training time and create persistent vulnerabilities.
Insert samples with specific triggers that cause targeted misclassification or behavior when present at inference.
Mislabel training examples to teach the model incorrect associations between inputs and outputs.
Add completely new malicious samples to training data via web scraping, user submissions, or supply chain compromise.
Poison fine-tuning datasets to introduce vulnerabilities in domain-specific or customized models.
9.2Adversarial Examples
Adversarial examples are inputs designed to cause model misclassification. Small, often imperceptible perturbations to inputs can drastically change model outputs.
FGSM Attack
Fast Gradient Sign Method: Single-step attack using gradient direction to create perturbations.
PGD Attack
Projected Gradient Descent: Iterative attack producing stronger adversarial examples.
Patch Attacks
Physical adversarial patches that work in real world when placed in camera view.
Universal Perturbations
Single perturbation causing misclassification for any input when applied.
9.3Model Extraction & Theft
Model extraction attacks attempt to steal or replicate a model by querying it extensively and training a copy on the input-output pairs. This threatens proprietary models and enables offline attacks.
/* MODEL EXTRACTION TECHNIQUES */ // ─── QUERY-BASED EXTRACTION ────────────────────────────────── "Send massive queries to collect input-output pairs" "Use active learning to efficiently select informative queries" "Request probabilities/logits when API exposes them" "Train surrogate model on collected data" // ─── ARCHITECTURE INFERENCE ────────────────────────────────── "Probe model capacity through increasingly complex inputs" "Analyze response latency for architecture hints" "Test tokenization patterns to identify model family" "Compare behaviors to known open-source models" // ─── SIDE-CHANNEL EXTRACTION ───────────────────────────────── "Timing analysis to infer model size/complexity" "Memory access patterns (physical access attacks)" "Power consumption analysis" "Cache timing attacks" // ─── DISTILLATION ATTACKS ──────────────────────────────────── "Use target model as teacher to label large dataset" "Train smaller student model to mimic teacher outputs" "Achieve similar performance at fraction of cost"
9.4ML Supply Chain Attacks
Trust Boundaries
ML supply chains include: pre-trained models, datasets, libraries, model hubs, fine-tuning services, and inference APIs. Each is a potential attack vector.
Upload trojaned models to Hugging Face, GitHub, or other hubs. Users download and deploy backdoored models unknowingly.
Contribute malicious samples to public datasets used for training or fine-tuning by other organizations.
Compromise ML libraries (PyTorch plugins, transformers) or inject malicious code into pip/npm packages.
Exploit unsafe deserialization in pickle files (.pkl, .pt) to execute arbitrary code when model is loaded.