Skip to content

Privacy Attacks and Defenses

Privacy risk in AI is not just a question of whether a model “contains personal data.” The real issue is whether outputs, gradients, embeddings, or retrieval pipelines leak information that should not be recoverable. A useful structure is to break the problem into membership inference, model inversion, model extraction, training-data extraction, federated leakage, and differential privacy.

This page follows that structure while extending it to LLMs, RAG, and machine unlearning.

1. Attack goals and information surfaces

Goal Attacker question Typical interface
Membership Was a specific record in training? classifier API, embedding API
Sensitive attributes Can hidden attributes be inferred? scores, embeddings
Sample content Can the original input be reconstructed? logits, generation API, gradients
Model behavior Can the model itself be copied? black-box query interfaces

Formally, privacy attacks exploit statistical dependence:

\[ I(\text{observed output}; \text{sensitive variable}) > 0 \]

Privacy attack taxonomy

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 4. Image note: a tree diagram classifies privacy attacks into membership inference, model inversion, model extraction, and attribute inference. Why it matters: privacy attacks are not a single problem but at least four independent attack surfaces.

2. Membership inference

2.1 Core intuition

Membership inference attack definition

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 6. Image note: the slide defines MIA as deciding whether a record belongs to the training dataset. Why it matters: membership inference is not about reconstructing full samples; it is about exploiting the model's extra familiarity with training examples.

Membership inference asks whether a sample was part of the training set. Many attacks work because models are often more confident and more stable on training data than on non-members.

Common signals include:

  • unusually high confidence
  • lower entropy
  • larger prediction margins
  • stronger stability under perturbation

2.2 Shadow training attacks

Shadow training attack

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 11. Image note: the slide shows how an attacker trains multiple shadow models on similar-distribution data and then learns an attack classifier from their in/out behavior. Why it matters: black-box access can still be enough to recover membership signals.

Shadow training typically works by:

  1. collecting data similar to the target distribution
  2. training several shadow models
  3. recording member vs non-member outputs
  4. training an attack classifier on those outputs
  5. transferring the attack to the target model

Confidence-based membership inference

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 16. Image note: a flowchart shows how the model's confidence score on an input is used to decide whether that sample belongs to the training set. Why it matters: models are typically more confident on training samples — this phenomenon is the core premise that makes membership inference attacks work.

3. Attribute inference and model inversion

Attribute inference assumes the attacker knows most of a sample and wants to recover missing sensitive attributes, such as age, gender, health status, or demographic membership from scores or embeddings.

Model inversion goes further and tries to reconstruct an input prototype by maximizing a target score:

\[ x^\* = \arg\max_x f_\theta(x)_y - \lambda R(x) \]

In vision this may reconstruct class prototypes or average faces; in language models it may expose memorized strings or template-like content.

Model inversion attack intuition

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 19. Image note: the slide places reconstructed images beside the target identity or class. Why it matters: inversion is about recovering representative features or prototypes, which still reveals training information even when full records are not perfectly recovered.

4. Model extraction and training-data extraction

Model extraction attacks try to clone a service model through black-box queries. Motivations include:

MLaaS as an attack surface

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 21. Image note: the slide frames Machine Learning as a Service as a realistic black-box attack surface. Why it matters: once models are productized behind APIs, query access simultaneously enables theft, evasion, and privacy probing.

Core idea of model extraction

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 22. Image note: the figure connects the target model, the query process, and the substitute model. Why it matters: extraction does not need to recover original weights; matching the decision surface is often enough to create a high-value surrogate.

Model extraction motivation

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 23. Image note: three columns compare the consequences of model extraction: intellectual property loss, training data exposure, and expanded attack surface. Why it matters: model extraction is not just an economic problem — a stolen model can also be used to generate adversarial examples.

  • avoiding API cost
  • improving black-box adversarial attacks with a surrogate model
  • enabling downstream privacy or jailbreak analysis

Training-data extraction focuses on whether memorized text can be reproduced, especially:

  • rare strings
  • PII-like formats
  • repeated web-scale templates
  • high-entropy secrets

The course also highlights extraction risk in RAG and chat-style systems, not just in base models.

GPT-2 training-data extraction

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 29. Image note: the slide summarizes representative training-data extraction results from GPT-2. Why it matters: whenever corpora contain rare, repetitive, or templated strings, generation models may reproduce them verbatim under the right prompting conditions.

ChatGPT-style data extraction workflow

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 44. Image note: the slide shows a prompt, verification, and majority-voting workflow for extracting data from a chat system. Why it matters: modern extraction attacks are iterative pipelines, not just a few lucky prompts.

5. LLM, RAG, and embedding privacy

5.1 RAG is often the real leakage path

In RAG systems, privacy failure is often caused by:

  • sensitive documents in the retrieval index
  • enumeration-style prompts
  • weak retrieval authorization
  • logs, caches, or memory retaining sensitive context

5.2 Embedding attacks

Embedding models are not automatically safe. If the representation retains enough separable structure, attackers may:

  • train probes for sensitive attributes
  • link users via nearest-neighbor search
  • approximately reconstruct original content with generative inversion

6. Federated and distributed leakage

Federated learning is privacy-enhancing, but not privacy-free. Once gradients or updates are visible, an attacker may attempt:

  • gradient leakage
  • source inference
  • client-level membership inference
  • reconstruction from aggregation artifacts

Keeping raw data local does not guarantee that no information leaves the device.

Drawbacks of centralized ML

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 74. Image note: the slide motivates distributed or federated approaches by showing how centralized training aggregates sensitive data. Why it matters: federated learning is interesting because centralized pipelines already create a large leakage surface, not because federated updates are automatically safe.

7. Differential privacy and training-time mitigation

7.1 Differential privacy

Differential privacy requires neighboring datasets \(D\) and \(D'\) to produce similar output distributions:

\[ \Pr[\mathcal{M}(D) \in S] \le e^\epsilon \Pr[\mathcal{M}(D') \in S] + \delta \]

7.2 DP-SGD

DP-SGD algorithm

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 71. Image note: the slide presents the standard DP-SGD loop with per-example clipping, Gaussian noise, and privacy accounting. Why it matters: privacy protection has to be integrated into optimization rather than patched on after training.

The DP-SGD recipe is:

  1. compute per-example gradients
  2. clip them
  3. add noise after aggregation
  4. track the privacy budget

Its strengths are formal guarantees; its costs include reduced utility, more expensive training, and harder budget accounting in large-model pipelines.

Differential privacy as a design family

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 67. Image note: the page groups randomized mechanisms, Laplace/Gaussian mechanisms, and example applications under one DP view. Why it matters: DP is not a single algorithm name; it is a design discipline around provable leakage bounds.

8. Deployment-time mitigation

Effective privacy protection is layered:

Layer Typical controls Goal
Data PII detection, minimization, de-identification reduce what can be memorized
Training DP-SGD, regularization, early stopping reduce memorization and overfitting
Interface limited outputs, rate limits, safe APIs reduce observable attack signal
System access control, redacted logs, audit reduce side leakage paths
Governance deletion flows, unlearning, compliance respond to legal and operational needs

8.1 Machine unlearning

The course emphasizes that unlearning is not just a theoretical problem. In production it means coordinating:

  • base models
  • retrieval indexes
  • caches
  • fine-tuning datasets
  • logs and stored traces

LLM unlearning as a system problem

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 82. Image note: the slide places prompt tuning, parameter-efficient unlearning, and evaluation into one pipeline. Why it matters: honoring deletion requests is not one command; it requires coordinated changes to training, indexing, and deployment artifacts.

Privacy policy and institutional controls

Source: Tufts EE141 Trusted AI, Lecture 5, Slide 85. Image note: the slide closes with privacy-policy and institutional language. Why it matters: many privacy failures come from default collection, retention, and sharing choices rather than from one isolated model bug.

Relations to other topics

References

  • Tufts EE141 Trusted AI Course Slides, Privacy Lecture, Spring 2026.
  • Shokri et al., "Membership Inference Attacks Against Machine Learning Models", IEEE S&P 2017.
  • Fredrikson et al., "Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures", CCS 2015.
  • Tramèr et al., "Stealing Machine Learning Models via Prediction APIs", USENIX Security 2016.
  • Carlini et al., "Extracting Training Data from Large Language Models", USENIX Security 2021.
  • Abadi et al., "Deep Learning with Differential Privacy", CCS 2016.

评论 #