Privacy Attacks and Defenses
Privacy risk in AI is not just a question of whether a model “contains personal data.” The real issue is whether outputs, gradients, embeddings, or retrieval pipelines leak information that should not be recoverable. A useful structure is to break the problem into membership inference, model inversion, model extraction, training-data extraction, federated leakage, and differential privacy.
This page follows that structure while extending it to LLMs, RAG, and machine unlearning.
1. Attack goals and information surfaces
| Goal | Attacker question | Typical interface |
|---|---|---|
| Membership | Was a specific record in training? | classifier API, embedding API |
| Sensitive attributes | Can hidden attributes be inferred? | scores, embeddings |
| Sample content | Can the original input be reconstructed? | logits, generation API, gradients |
| Model behavior | Can the model itself be copied? | black-box query interfaces |
Formally, privacy attacks exploit statistical dependence:

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 4. Image note: a tree diagram classifies privacy attacks into membership inference, model inversion, model extraction, and attribute inference. Why it matters: privacy attacks are not a single problem but at least four independent attack surfaces.
2. Membership inference
2.1 Core intuition

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 6. Image note: the slide defines MIA as deciding whether a record belongs to the training dataset. Why it matters: membership inference is not about reconstructing full samples; it is about exploiting the model's extra familiarity with training examples.
Membership inference asks whether a sample was part of the training set. Many attacks work because models are often more confident and more stable on training data than on non-members.
Common signals include:
- unusually high confidence
- lower entropy
- larger prediction margins
- stronger stability under perturbation
2.2 Shadow training attacks

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 11. Image note: the slide shows how an attacker trains multiple shadow models on similar-distribution data and then learns an attack classifier from their in/out behavior. Why it matters: black-box access can still be enough to recover membership signals.
Shadow training typically works by:
- collecting data similar to the target distribution
- training several shadow models
- recording member vs non-member outputs
- training an attack classifier on those outputs
- transferring the attack to the target model

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 16. Image note: a flowchart shows how the model's confidence score on an input is used to decide whether that sample belongs to the training set. Why it matters: models are typically more confident on training samples — this phenomenon is the core premise that makes membership inference attacks work.
3. Attribute inference and model inversion
Attribute inference assumes the attacker knows most of a sample and wants to recover missing sensitive attributes, such as age, gender, health status, or demographic membership from scores or embeddings.
Model inversion goes further and tries to reconstruct an input prototype by maximizing a target score:
In vision this may reconstruct class prototypes or average faces; in language models it may expose memorized strings or template-like content.

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 19. Image note: the slide places reconstructed images beside the target identity or class. Why it matters: inversion is about recovering representative features or prototypes, which still reveals training information even when full records are not perfectly recovered.
4. Model extraction and training-data extraction
Model extraction attacks try to clone a service model through black-box queries. Motivations include:

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 21. Image note: the slide frames Machine Learning as a Service as a realistic black-box attack surface. Why it matters: once models are productized behind APIs, query access simultaneously enables theft, evasion, and privacy probing.

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 22. Image note: the figure connects the target model, the query process, and the substitute model. Why it matters: extraction does not need to recover original weights; matching the decision surface is often enough to create a high-value surrogate.

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 23. Image note: three columns compare the consequences of model extraction: intellectual property loss, training data exposure, and expanded attack surface. Why it matters: model extraction is not just an economic problem — a stolen model can also be used to generate adversarial examples.
- avoiding API cost
- improving black-box adversarial attacks with a surrogate model
- enabling downstream privacy or jailbreak analysis
Training-data extraction focuses on whether memorized text can be reproduced, especially:
- rare strings
- PII-like formats
- repeated web-scale templates
- high-entropy secrets
The course also highlights extraction risk in RAG and chat-style systems, not just in base models.

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 29. Image note: the slide summarizes representative training-data extraction results from GPT-2. Why it matters: whenever corpora contain rare, repetitive, or templated strings, generation models may reproduce them verbatim under the right prompting conditions.

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 44. Image note: the slide shows a prompt, verification, and majority-voting workflow for extracting data from a chat system. Why it matters: modern extraction attacks are iterative pipelines, not just a few lucky prompts.
5. LLM, RAG, and embedding privacy
5.1 RAG is often the real leakage path
In RAG systems, privacy failure is often caused by:
- sensitive documents in the retrieval index
- enumeration-style prompts
- weak retrieval authorization
- logs, caches, or memory retaining sensitive context
5.2 Embedding attacks
Embedding models are not automatically safe. If the representation retains enough separable structure, attackers may:
- train probes for sensitive attributes
- link users via nearest-neighbor search
- approximately reconstruct original content with generative inversion
6. Federated and distributed leakage
Federated learning is privacy-enhancing, but not privacy-free. Once gradients or updates are visible, an attacker may attempt:
- gradient leakage
- source inference
- client-level membership inference
- reconstruction from aggregation artifacts
Keeping raw data local does not guarantee that no information leaves the device.

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 74. Image note: the slide motivates distributed or federated approaches by showing how centralized training aggregates sensitive data. Why it matters: federated learning is interesting because centralized pipelines already create a large leakage surface, not because federated updates are automatically safe.
7. Differential privacy and training-time mitigation
7.1 Differential privacy
Differential privacy requires neighboring datasets \(D\) and \(D'\) to produce similar output distributions:
7.2 DP-SGD

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 71. Image note: the slide presents the standard DP-SGD loop with per-example clipping, Gaussian noise, and privacy accounting. Why it matters: privacy protection has to be integrated into optimization rather than patched on after training.
The DP-SGD recipe is:
- compute per-example gradients
- clip them
- add noise after aggregation
- track the privacy budget
Its strengths are formal guarantees; its costs include reduced utility, more expensive training, and harder budget accounting in large-model pipelines.

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 67. Image note: the page groups randomized mechanisms, Laplace/Gaussian mechanisms, and example applications under one DP view. Why it matters: DP is not a single algorithm name; it is a design discipline around provable leakage bounds.
8. Deployment-time mitigation
Effective privacy protection is layered:
| Layer | Typical controls | Goal |
|---|---|---|
| Data | PII detection, minimization, de-identification | reduce what can be memorized |
| Training | DP-SGD, regularization, early stopping | reduce memorization and overfitting |
| Interface | limited outputs, rate limits, safe APIs | reduce observable attack signal |
| System | access control, redacted logs, audit | reduce side leakage paths |
| Governance | deletion flows, unlearning, compliance | respond to legal and operational needs |
8.1 Machine unlearning
The course emphasizes that unlearning is not just a theoretical problem. In production it means coordinating:
- base models
- retrieval indexes
- caches
- fine-tuning datasets
- logs and stored traces

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 82. Image note: the slide places prompt tuning, parameter-efficient unlearning, and evaluation into one pipeline. Why it matters: honoring deletion requests is not one command; it requires coordinated changes to training, indexing, and deployment artifacts.

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 85. Image note: the slide closes with privacy-policy and institutional language. Why it matters: many privacy failures come from default collection, retention, and sharing choices rather than from one isolated model bug.
Relations to other topics
- For the high-level framework, see AI Safety Overview
- For training-time shortcut attacks, see Backdoor Attacks
- For LLM deployment risks, see LLM Jailbreaking and AI Engineering Safety & Governance
- For governance and compliance, see AI Ethics & Governance
References
- Tufts EE141 Trusted AI Course Slides, Privacy Lecture, Spring 2026.
- Shokri et al., "Membership Inference Attacks Against Machine Learning Models", IEEE S&P 2017.
- Fredrikson et al., "Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures", CCS 2015.
- Tramèr et al., "Stealing Machine Learning Models via Prediction APIs", USENIX Security 2016.
- Carlini et al., "Extracting Training Data from Large Language Models", USENIX Security 2021.
- Abadi et al., "Deep Learning with Differential Privacy", CCS 2016.