AI Safety & Trustworthiness
This section organizes AI safety through a trustworthy-AI lens. It covers risks across training, inference, deployment, and governance, including adversarial attacks, backdoors, privacy leakage, jailbreaks, system security, interpretability, alignment, and red teaming.
This section reorganizes threat models, figure evidence, and engineering controls by canonical knowledge topic rather than by lecture order. For a fast overview, start with AI Safety Overview, then branch into the attack surface that matters most.
Section Map
1. Overview and framing
- AI Safety Overview: the capability-risk-control-governance frame, unified threat models, and the reading map for the section
2. Model attacks and privacy
- Adversarial Attack & Defense: white-box, black-box, physical attacks, transferability, and major defenses
- FGSM & PGD: mathematical foundations of gradient-based white-box attacks
- Backdoor Attacks: training-time poisoning, trigger design, detection, and mitigation
- Privacy Attacks: membership inference, model inversion, model extraction, differential privacy, and unlearning
3. LLM and system security
- LLM Jailbreaking: prompt-based, token-based, and multi-turn jailbreaks plus layered defenses
- Visual Instruction Injection: malicious instructions hidden in multimodal inputs
- Red Teaming: systematic security evaluation, dangerous capability evals, and regression testing
- For engineering deployment, see AI Engineering Safety & Governance and LLM & Agent System Security
4. Trustworthiness and governance
- AI Ethics & Governance: responsibility, fairness, transparency, regulation, and institutional governance
- AI Alignment: RLHF, Constitutional AI, reward hacking, and scalable oversight
- Explainability & Robustness: XAI, Grad-CAM, LIME, mechanistic interpretability, and trustworthiness evidence
- Authenticity & Privacy Protection: provenance, authenticity, and privacy-preserving deployment
Suggested reading order
- Start with AI Safety Overview to get the capability-risk-control-governance frame.
- Then move into Adversarial Attack & Defense, Backdoor Attacks, Privacy Attacks, or LLM Jailbreaking depending on the threat surface.
- Finish with Red Teaming, AI Alignment, AI Ethics & Governance, and Explainability & Robustness to connect research, deployment, and governance.