Authenticity and Privacy Protection
With the rapid advancement of generative AI capabilities, content authenticity and privacy protection have become two unavoidable topics in trustworthy AI. Deepfakes can fabricate anyone's face and voice, and large model training data may contain users' private information. This article covers the core techniques and countermeasures in both of these areas.
Deepfakes and Detection
Generation Techniques
Deepfake is a broad term for forged media content (images, video, audio) generated using deep learning. The term originated from a Reddit user's handle in 2017 and has since become the general term for AI-generated fake content. The main technical approaches include:
- Face Swap: Transfers the facial features of a source person onto a target video. A representative tool is DeepFaceLab, whose core architecture is an autoencoder-based encoder-decoder framework: the source face and target face share an encoder but use separate decoders, enabling facial feature exchange. During training, the two decoders learn to reconstruct their respective faces; during inference, the source face's encoding is fed into the target face's decoder to produce the swapped face.
- Face Reenactment: Preserves the target person's appearance while transferring the source person's expressions, head pose, and other movements. A representative method is the First Order Motion Model, which uses keypoint detection and optical flow estimation to achieve motion transfer without requiring subject-specific training. This means an attacker can manipulate someone's "digital avatar" in real time.
- Voice Cloning: Synthesizes arbitrary speech content with the timbre and prosodic characteristics of a target speaker, based on only a small sample of their voice. Early methods (e.g., SV2TTS) required seconds to minutes of reference audio; modern methods (e.g., VALL-E) can achieve high-quality cloning with only 3 seconds of voice samples, posing a serious threat to social engineering attacks such as phone scams.
- Diffusion-Based Generation: The advent of diffusion models has dramatically improved the quality and diversity of forged content. Through text-to-image (e.g., Stable Diffusion, DALL-E 3) or image-to-image approaches, extremely realistic human images can be generated, no longer limited to face-swapping scenarios. Combined with conditional control techniques like ControlNet, attackers can precisely control the pose, expression, and background of generated images, making forged content even harder to distinguish.
Detection Methods
Deepfake detection is an ongoing "arms race" — as generation techniques continue to advance, detection methods must evolve accordingly.
Frequency Analysis
GAN-generated images typically exhibit characteristic spectral artifacts in the frequency domain, such as periodic high-frequency patterns. These artifacts originate from GANs' upsampling operations (e.g., the "checkerboard effect" in transposed convolutions). By applying FFT (Fast Fourier Transform) or DCT analysis to an image, these anomalies — invisible to the human eye — can be revealed.
Specifically, after transforming an image into the frequency domain, real images typically exhibit a smooth decay pattern in the spectrum, whereas GAN-generated images show anomalous peaks or periodic patterns. However, images generated by diffusion models display significantly different artifact patterns in the frequency domain compared to GANs, requiring specifically designed detection features.
Biological Signal Detection
Real video of human faces contains subtle physiological signals, such as skin color micro-variations caused by heartbeat (rPPG, remote Photoplethysmography), natural blink frequency and patterns, and pupillary light reflexes. Deepfake videos typically cannot perfectly replicate these signals.
For example, people in early deepfake videos almost never blinked, which served as a simple detection cue (though newer methods have addressed this flaw). More advanced approaches analyze the spatiotemporal consistency of facial blood flow signals — in real video, the rPPG signals from different facial regions should remain temporally synchronized, whereas this synchrony is often disrupted in deepfake videos.
Deepfake Detection Models
From a machine learning perspective, deepfake detection is essentially a binary classification problem — determining whether the input is real or forged. The typical pipeline is:
- Preprocessing: Crop the face region from video frames (using face detectors such as MTCNN or RetinaFace)
- Feature Extraction: Use a pretrained image classification network (e.g., EfficientNet, XceptionNet) as the backbone
- Fine-tuning: Perform binary classification fine-tuning on deepfake datasets
- Post-processing: Aggregate detection results across multiple frames to obtain a video-level decision
Training data comes from established deepfake datasets such as FaceForensics++ (containing 5 forgery methods), Celeb-DF (high-quality face-swap videos), and DFDC (a large-scale competition dataset initiated by Facebook). Some methods also incorporate temporal information (leveraging inter-frame inconsistencies) or multimodal information (simultaneously analyzing audio-video synchronization, checking whether lip movements match speech).
Core Challenge: Cross-Generator Generalization
Cross-generator generalization is the most challenging problem in deepfake detection. A detector trained on one generation method (e.g., FaceSwap) often suffers severe performance degradation when confronted with another method (e.g., diffusion-based). This is essentially a domain generalization problem.
Potential solutions include:
- Learning universal forgery features (e.g., unnatural texture details, blending traces at facial boundaries, inconsistent lighting directions) rather than artifacts specific to a particular generator
- Using contrastive learning to help models learn the essential distinction between "real vs. forged" rather than memorizing features of specific forgery methods
- Conducting multi-source training across multiple generation methods to enhance generalization
- Applying data augmentation to simulate various post-processing operations (compression, blurring, cropping), improving model robustness to unseen transformations
Digital Watermarking
Digital watermarking embeds imperceptible markers into content to enable copyright protection and content provenance. In an era of proliferating AI-generated content (AIGC), watermarking has become a critical technical tool for distinguishing real from synthetic content. A good watermarking system must satisfy three fundamental requirements: imperceptibility (no degradation of content quality), robustness (detectable even after compression, cropping, and other operations), and security (difficult to maliciously remove or forge).
Traditional Watermarking
- Spatial Domain Watermarking: Directly modifies pixel values to embed watermark information, for example by modifying the Least Significant Bit (LSB). This approach is straightforward but has poor robustness, being easily destroyed by image compression, cropping, noise addition, and other operations. It is suitable for scenarios with low robustness requirements, such as digital copyright declarations.
- Frequency Domain Watermarking: Embeds the watermark after transforming the image into the frequency domain, leveraging the redundancy of frequency coefficients to hide information. Common transforms include: * DCT (Discrete Cosine Transform): The foundational transform of JPEG compression. Watermarks are embedded in mid-frequency DCT coefficients — low-frequency coefficients carry the main visual information and should not be modified, high-frequency coefficients are easily discarded by compression, and mid-frequency coefficients strike a balance between the two, providing good robustness against JPEG compression. * DWT (Discrete Wavelet Transform): Features multi-resolution analysis capabilities, decomposing images into subbands of different frequencies and orientations. Watermarks can be embedded in wavelet coefficients at different scales, balancing imperceptibility and robustness. DWT watermarks generally outperform DCT watermarks in resisting geometric attacks (e.g., rotation, scaling).
AIGC Watermarking
With the explosion of AIGC, watermarking techniques specifically designed for AI-generated content have become a research hotspot. Unlike traditional watermarking, AIGC watermarks can be embedded during the generation process rather than added after the fact.
Diffusion Model Output Watermarking
Watermarks are embedded during the generation process of diffusion models. For example, the Tree-Ring Watermark method embeds a concentric ring pattern (tree-ring pattern) in the Fourier space of the initial noise, which is preserved throughout the entire denoising process and ultimately appears in the generated image. During detection, DDIM inversion is applied to the image to recover the initial noise, and the frequency domain is then examined for the presence of the ring pattern.
The advantages of this approach are:
- No impact on generation quality — the watermark is embedded in the semantically irrelevant noise space
- Robust against image editing operations — the ring pattern remains detectable even after cropping, compression, noise addition, and other operations
- No model modification required — only the initial noise needs to be modified
Text Watermarking
The KGW scheme (Kirchenbauer et al., 2023) is a landmark work in LLM text watermarking. The core idea is:
- At each token generation step, use the previous token as a seed to partition the vocabulary into a "Green List" and a "Red List" via a hash function
- During sampling, apply a positive bias \(\delta\) to tokens in the green list (increasing their logit values), making them more likely to be selected
- During detection, count the proportion of green tokens in the text — watermarked text will contain an anomalously high proportion of green tokens, which can be determined through a z-test
This scheme achieves reliable watermark detection with virtually no impact on text quality. Its main limitations are: (1) limited robustness against paraphrasing; (2) low-entropy tokens (e.g., "the", "of", and other nearly deterministic words) cannot effectively carry watermark information; (3) the watermark's presence could be exploited by malicious third parties to audit whether specific text was generated by a particular LLM.
C2PA Standard (Content Provenance)
C2PA (Coalition for Content Provenance and Authenticity) is not a watermarking algorithm but rather a content provenance standard. It embeds cryptographically signed provenance information (such as creator, creation tool, editing history, etc.) into content metadata, enabling users to verify the complete provenance chain of the content.
C2PA is jointly promoted by Adobe, Microsoft, Intel, and other companies, with the goal of establishing a "nutrition label for content" — it does not judge the authenticity of content but provides transparent provenance information, allowing users to make their own judgments. C2PA uses Public Key Infrastructure (PKI) for signature verification, ensuring that provenance information cannot be forged. Its limitations include: (1) it requires widespread adoption across the entire ecosystem (camera manufacturers, software platforms, social media) to be effective; (2) it cannot cover generation tools that do not voluntarily participate in the C2PA ecosystem.
Privacy Protection
AI systems inevitably encounter large volumes of data during training and deployment, which may contain personally sensitive information. Balancing data utility with privacy protection is a core challenge of trustworthy AI. Research has shown that large language models can "memorize" and verbatim reproduce personal information from training data (such as phone numbers and email addresses), making privacy protection particularly urgent.
Data Anonymization
Data anonymization transforms raw data so that individuals cannot be re-identified. Classical methods form a progressively stronger defense framework:
- K-anonymity: Ensures that each record in the dataset is indistinguishable from at least \(k-1\) other records on quasi-identifiers (such as age, zip code, and gender). For example, \(k=5\) means that every record has at least 4 "equivalent" records, preventing an attacker from pinpointing a specific individual. Implementation techniques include generalization (e.g., replacing exact age with age ranges) and suppression (removing certain records).
- L-diversity: Building on K-anonymity, it further requires that the sensitive attribute within each equivalence class has at least \(l\) distinct values. This prevents homogeneity attacks — even if an equivalence class contains 5 records satisfying K-anonymity, if they all share the same disease type, an attacker can still infer the disease of everyone in that class.
- T-closeness: Requires that the distribution of the sensitive attribute within each equivalence class differs from the global distribution by no more than a threshold \(t\) (typically measured using Earth Mover's Distance). This further prevents skewness attacks — even if L-diversity is satisfied, if 90% of the people in an equivalence class have a particular disease, an attacker can still infer the target individual's health status with high probability.
Differential Privacy
Differential privacy is currently the gold standard for privacy protection, proposed by Dwork et al. in 2006, providing rigorous mathematical guarantees.
Definition
A randomized algorithm \(\mathcal{M}\) satisfies \(\varepsilon\)-differential privacy if and only if for any two neighboring datasets \(D\) and \(D'\) that differ in at most one record, and for any output set \(S\):
\(\varepsilon\) is called the privacy budget — the smaller its value, the stronger the privacy protection. Intuitively, differential privacy guarantees that regardless of whether a particular individual participates in the dataset, the output distribution of the algorithm remains nearly unchanged — an attacker cannot determine whether an individual's data is in the dataset by observing the output.
In practice, the relaxed version \((\varepsilon, \delta)\)-differential privacy is more commonly used, allowing the strict \(\varepsilon\) guarantee to be violated with a very small probability \(\delta\):
Noise Mechanisms
- Laplace Mechanism: Adds noise drawn from a Laplace distribution \(\text{Lap}(\Delta f / \varepsilon)\) to numerical query results. The noise magnitude is proportional to the query's global sensitivity \(\Delta f\) — the higher the sensitivity, the more noise is required. This satisfies \(\varepsilon\)-differential privacy.
- Gaussian Mechanism: Adds Gaussian noise \(\mathcal{N}(0, \sigma^2)\), where \(\sigma \geq \Delta f \cdot \sqrt{2\ln(1.25/\delta)} / \varepsilon\). This satisfies \((\varepsilon, \delta)\)-differential privacy and is generally more practical in high-dimensional settings, as Gaussian noise has faster tail decay.
DP-SGD
Differentially Private Stochastic Gradient Descent (DP-SGD) (Abadi et al., 2016) introduces differential privacy into the deep learning training process, ensuring that the trained model satisfies differential privacy guarantees:
- Per-sample Gradient Computation: Computes the gradient for each training sample individually, rather than computing the batch gradient directly
- Gradient Clipping: Clips each sample's gradient to a fixed norm \(C\), i.e., \(\bar{g}_i = g_i / \max(1, \|g_i\|_2 / C)\), limiting the maximum influence of any single sample on the model update
- Noise Addition: Adds calibrated Gaussian noise to the aggregated gradient: \(\tilde{g} = \frac{1}{B}(\sum_i \bar{g}_i + \mathcal{N}(0, \sigma^2 C^2 I))\)
- Privacy Accounting: Uses the Moments Accountant or Renyi Differential Privacy to track cumulative privacy loss, ensuring the entire training process stays within the total privacy budget
The main challenge of DP-SGD is the privacy-utility tradeoff: stronger privacy protection (smaller \(\varepsilon\)) leads to more pronounced accuracy degradation. In the era of large models, DP-SGD also incurs high computational overhead due to the massive number of model parameters (requiring per-sample gradient computation).
Federated Learning
Federated learning is a distributed training paradigm that allows multiple parties to collaboratively train a model without sharing raw data. It has significant practical value in domains with stringent data privacy requirements, such as healthcare (multiple hospitals collaborating to train diagnostic models) and finance (multiple banks collaborating to train risk control models).
FedAvg Algorithm
Federated Averaging (FedAvg) (McMahan et al., 2017) is the most classic federated learning algorithm:
- The server distributes the global model parameters \(w_t\) to all (or a randomly selected subset of) clients
- Each client \(k\) trains on its local data \(D_k\) for \(E\) epochs, producing a local update \(w_t^k\)
- Clients upload model updates (not raw data) to the server
- The server aggregates all updates weighted by data volume: \(w_{t+1} = \sum_k \frac{|D_k|}{|D|} w_t^k\)
- The process repeats until convergence
The core advantage of FedAvg is that it dramatically reduces communication overhead (only one round of model parameter transmission per iteration) and allows clients to perform multiple rounds of local training.
The main challenges facing FedAvg include:
- Non-IID Data: Data distributions across clients can differ substantially (e.g., different hospitals have different disease compositions), causing inconsistent local update directions and making global model convergence difficult. Follow-up methods such as FedProx add a regularization term to constrain the degree to which local updates deviate from the global model, while SCAFFOLD corrects gradient drift through control variates.
- Communication Efficiency: Although FedAvg already reduces the number of communication rounds, transmitting complete model parameters remains a bottleneck for large models. Techniques such as gradient compression and model distillation can further reduce communication overhead.
- Device Heterogeneity: Clients vary greatly in computational power and network bandwidth, with "straggler" clients slowing down the entire training process. Asynchronous federated learning methods attempt to address this issue.
Privacy Guarantees and Limitations
Federated learning does not equate to privacy protection. Although raw data never leaves the local device, model updates (gradients) themselves can leak private information. Research has shown that:
- Gradient Inversion Attacks: Attackers can approximately reconstruct original training data from shared gradients through optimization methods. For example, Zhu et al.'s (2019) DLG method can recover training images at the pixel level from gradients under certain conditions.
- Membership Inference Attacks: Attackers can determine whether a specific data point was used in training.
Therefore, federated learning typically needs to be combined with differential privacy (adding noise to local gradients) or secure aggregation (using cryptographic protocols to ensure the server can only see the aggregated gradient, not individual client gradients) to provide stronger privacy guarantees.
Machine Unlearning
Background and Motivation
The EU's GDPR grants users the "Right to be Forgotten" (RTBF): users have the right to request that data controllers delete their personal data. When this right extends to the machine learning domain, the problem becomes complex — simply removing data from the training set is insufficient, because the model's parameters have already "memorized" information from that data.
The goal of machine unlearning is to make the model behave as if it had never seen the data requested for deletion. Ideally, the unlearned model should be indistinguishable from a model retrained from scratch on the dataset with the target data removed.
The most naive approach is to retrain the model from scratch after removing the data, but for large-scale models and datasets, this is computationally infeasible (imagine retraining a GPT-4-scale model from scratch). Researchers have therefore explored more efficient unlearning methods.
Exact Unlearning vs. Approximate Unlearning
- Exact Unlearning: The unlearned model is exactly equivalent to a model retrained from scratch on the remaining data. This is the ideal standard, providing the strongest privacy guarantees, but typically requires special design during the training phase (e.g., the SISA method).
- Approximate Unlearning: The unlearned model is statistically indistinguishable from a retrained model (similar to differential privacy's \((\varepsilon, \delta)\) guarantee). It allows some approximation error but dramatically reduces computational cost. Methods include parameter adjustment based on influence functions, gradient ascent-based "reverse learning," and knowledge distillation-based approaches.
SISA Training Method
SISA (Sharded, Isolated, Sliced, and Aggregated) (Bourtoule et al., 2021) is a representative exact unlearning method that reduces unlearning cost to an acceptable level through clever training architecture design:
- Sharding: Partition the training data into \(S\) non-overlapping subsets (shards)
- Isolated Training: Train a sub-model independently on each shard, with no interaction between sub-models
- Slicing: Further partition each shard sequentially into slices, incrementally incorporating new slices during training, and saving intermediate checkpoints after each slice is trained
- Aggregation: At inference time, aggregate predictions from all sub-models (e.g., majority voting or probability averaging)
When data needs to be unlearned, one simply locates the shard and slice containing that data and retrains that shard from the nearest checkpoint, rather than retraining the entire model. This reduces the computational cost of unlearning from \(O(N)\) to approximately \(O(N/S)\), where \(S\) is the number of shards. More shards mean faster unlearning, but since each sub-model has less training data, overall model accuracy may decrease — this is the accuracy-unlearning efficiency tradeoff inherent in the SISA method.
Verifying Machine Unlearning
How can one verify that "unlearning" has truly succeeded? Common verification methods include:
- Membership Inference Attacks: Launch membership inference attacks against the unlearned model to check whether the target data still "looks like training data." If the attack success rate approaches random guessing (50%), the unlearning is relatively thorough.
- Backdoor Verification: If the data to be unlearned contains specific patterns (e.g., backdoor triggers), check whether the unlearned model's response to those patterns has been eliminated.
- Distribution Comparison: Statistically compare the output distribution of the unlearned model with that of a model retrained from scratch (e.g., using KL divergence) — the smaller the difference, the closer the unlearning is to "exact."
Machine unlearning faces even greater challenges in the LLM era: large language models are typically trained on trillion-token-scale data at extremely high cost, and knowledge is distributed across parameters in a highly diffuse manner. How to reliably "unlearn" specific knowledge without retraining the entire model remains an open problem.