Skip to content

AI Safety & Trustworthiness

This section organizes AI safety through a trustworthy-AI lens. It covers risks across training, inference, deployment, and governance, including adversarial attacks, backdoors, privacy leakage, jailbreaks, system security, interpretability, alignment, and red teaming.

This section reorganizes threat models, figure evidence, and engineering controls by canonical knowledge topic rather than by lecture order. For a fast overview, start with AI Safety Overview, then branch into the attack surface that matters most.

Section Map

1. Overview and framing

  • AI Safety Overview: the capability-risk-control-governance frame, unified threat models, and the reading map for the section

2. Model attacks and privacy

  • Adversarial Attack & Defense: white-box, black-box, physical attacks, transferability, and major defenses
  • FGSM & PGD: mathematical foundations of gradient-based white-box attacks
  • Backdoor Attacks: training-time poisoning, trigger design, detection, and mitigation
  • Privacy Attacks: membership inference, model inversion, model extraction, differential privacy, and unlearning

3. LLM and system security

4. Trustworthiness and governance

Suggested reading order

  1. Start with AI Safety Overview to get the capability-risk-control-governance frame.
  2. Then move into Adversarial Attack & Defense, Backdoor Attacks, Privacy Attacks, or LLM Jailbreaking depending on the threat surface.
  3. Finish with Red Teaming, AI Alignment, AI Ethics & Governance, and Explainability & Robustness to connect research, deployment, and governance.

评论 #